You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under 1 gpu setting, optimize_model can be successfully used for inference acceleration。But errors will be reported in 2 gpus, using pipeline parallel:
code is :
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]
Steps to reproduce
model:llama-7b-hf
code:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]
Call using an FX-traced Module, line 5 of the traced Module's generated forward function:
def forward(self, t : torch.Tensor):
to = t.to(1, non_blocking = False); t = None
return (to,)
0%| | 0/15 [00:23<?, ?it/s]
Traceback (most recent call last):
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 328, in cudagraphify_impl
static_outputs = model(list(static_inputs))
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/optimizer/cuda_graph.py", line 130, in <lambda>
model=lambda args: model(*args), inputs=new_inputs, static_input_idxs=tuple(range(len(inputs)))
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/fx/graph_module.py", line 662, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/fx/graph_module.py", line 279, in __call__
raise e.with_traceback(None)
RuntimeError: CUDA error: dependency created on uncaptured work in another stream
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 216, in <module>
main()
File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 212, in main
do_evaluate(args)
File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 39, in wrapper
res = func(*args, **kwargs)
File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 180, in do_evaluate
gen_texts = generate(text, model, tokenizer, config, args.seed)
File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 148, in generate
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/generation/utils.py", line 1563, in generate
return self.sample(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/generation/utils.py", line 2610, in sample
outputs = self(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
return fn(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/model_optimization.py", line 64, in run
return model.forward_original(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in <graph break in new_forward>
output = old_forward(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 282, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 133, in send_to_device
return recursively_apply(_send_to_device, tensor, device, non_blocking, test_type=_has_to_method)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 82, in recursively_apply
return honor_type(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 53, in honor_type
return type(obj)(generator)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 85, in <genexpr>
recursively_apply(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
return func(data, *args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 124, in _send_to_device
def _send_to_device(t, device, non_blocking):
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
return fn(*args, **kwargs)
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/optimizer/cuda_graph.py", line 129, in run
f = cudagraphify_impl(
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 328, in cudagraphify_impl
static_outputs = model(list(static_inputs))
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/cuda/graphs.py", line 173, in __exit__
self.cuda_graph.capture_end()
File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/cuda/graphs.py", line 79, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
### Your environment
- Operating system and version (e.g. Ubuntu 20.04.2 LTS):
- Python version (e.g. Python 3.9.16):
- torch:2.0.0 same as requirements.txt
### Self-service
- [ ] I would be willing to help fix this bug myself.
### Code of Conduct
- [X] I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
Description
Under 1 gpu setting, optimize_model can be successfully used for inference acceleration。But errors will be reported in 2 gpus, using pipeline parallel:
code is :
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]
Steps to reproduce
model:llama-7b-hf
code:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]
shell: CUDA_VISIBLE_DEVICES=2,3 python kernl.py $args
Expected Behavior
success inference
Actual Behavior
Call using an FX-traced Module, line 5 of the traced Module's generated forward function:
def forward(self, t : torch.Tensor):
to = t.to(1, non_blocking = False); t = None
The text was updated successfully, but these errors were encountered: