Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: does kernl support pipeline parallel? #323

Open
ninisy opened this issue Apr 27, 2023 · 0 comments
Open

bug: does kernl support pipeline parallel? #323

ninisy opened this issue Apr 27, 2023 · 0 comments

Comments

@ninisy
Copy link

ninisy commented Apr 27, 2023

Description

Under 1 gpu setting, optimize_model can be successfully used for inference acceleration。But errors will be reported in 2 gpus, using pipeline parallel:
code is :
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]

Steps to reproduce

model:llama-7b-hf
code:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True).eval()
optimize_model(model)
tokenizer = load_tokenizer(tokenizer_path, config)
with torch.inference_mode(), torch.autocast(dtype=torch.float16, cache_enabled=True, device_type="cuda"):
input_ids = tokenizer(text, return_tensors="pt").input_ids
generation_config = GenerationConfig(**config)
output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
gen_texts = [t for t in tokenizer.batch_decode(output_ids,
skip_special_tokens=config["skip_special_tokens"])]

shell: CUDA_VISIBLE_DEVICES=2,3 python kernl.py $args

Expected Behavior

success inference

Actual Behavior

Call using an FX-traced Module, line 5 of the traced Module's generated forward function:
def forward(self, t : torch.Tensor):
to = t.to(1, non_blocking = False); t = None

    return (to,)

    


  0%|          | 0/15 [00:23<?, ?it/s]
Traceback (most recent call last):
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 328, in cudagraphify_impl
    static_outputs = model(list(static_inputs))
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/optimizer/cuda_graph.py", line 130, in <lambda>
    model=lambda args: model(*args), inputs=new_inputs, static_input_idxs=tuple(range(len(inputs)))
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/fx/graph_module.py", line 662, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/fx/graph_module.py", line 279, in __call__
    raise e.with_traceback(None)
RuntimeError: CUDA error: dependency created on uncaptured work in another stream
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 216, in <module>
    main()
  File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 212, in main
    do_evaluate(args)
  File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 39, in wrapper
    res = func(*args, **kwargs)
  File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 180, in do_evaluate
    gen_texts = generate(text, model, tokenizer, config, args.seed)
  File "/cognitive_comp/renxiaoqin/workspace/ccnl_chatgpt_tob/dev_liuhan_sft/gpt-neox/workspace/inference/model_inference_kernl.py", line 148, in generate
    output_ids = model.generate(inputs=input_ids, generation_config=generation_config)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/generation/utils.py", line 1563, in generate
    return self.sample(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/generation/utils.py", line 2610, in sample
    outputs = self(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/model_optimization.py", line 64, in run
    return model.forward_original(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in <graph break in new_forward>
    output = old_forward(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/hooks.py", line 282, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 133, in send_to_device
    return recursively_apply(_send_to_device, tensor, device, non_blocking, test_type=_has_to_method)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 82, in recursively_apply
    return honor_type(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 53, in honor_type
    return type(obj)(generator)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 85, in <genexpr>
    recursively_apply(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
    return func(data, *args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/accelerate/utils/operations.py", line 124, in _send_to_device
    def _send_to_device(t, device, non_blocking):
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/kernl/optimizer/cuda_graph.py", line 129, in run
    f = cudagraphify_impl(
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 328, in cudagraphify_impl
    static_outputs = model(list(static_inputs))
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/cuda/graphs.py", line 173, in __exit__
    self.cuda_graph.capture_end()
  File "/home/renxiaoqin/miniconda3/envs/py310_belle/lib/python3.9/site-packages/torch/cuda/graphs.py", line 79, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



### Your environment

- Operating system and version (e.g. Ubuntu 20.04.2 LTS):
- Python version (e.g. Python 3.9.16):
- torch:2.0.0 same as requirements.txt


### Self-service

- [ ] I would be willing to help fix this bug myself.

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant