Hello, I'm trying to run a self-converted version of Qwen2-VL 2B on an Intel GPU, but I keep getting an exception below while the same code works totally fine on a CPU. I've tried to trace the input of each step in the Python side but cannot figure out what the difference is between inference on a CPU and a GPU. Sometimes it'll works after I refresh my venv then another failure occurs the next time.
And I'm running this in Python 3.13.9 with the packages below. I also tried to run the code with the different composition of installed packages(also torch with different backend) and different model, but nothing changed.
(BTW, sometimes openvino 2025.2.0 will throw another error that occured on both CPU and GPU.)
src/vlm_server/internal/Model.py:285: in _response
generated_ids = self.model.generate(**inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py:123: in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:2618: in generate
result = self._sample(
.venv/lib/python3.13/site-packages/transformers/generation/utils.py:3602: in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/modeling_base.py:113: in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:2914: in forward
result = super().forward(
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:791: in forward
return self.language_model.forward(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <optimum.intel.openvino.modeling_visual_language.OVModelWithEmbedForCausalLM object at 0x7caf96699fd0>, input_ids = None
attention_mask = tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1... 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1]])
past_key_values = None
position_ids = tensor([[[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37,
38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])
inputs_embeds = tensor([[[-0.0075, 0.0098, 0.0053, ..., -0.0015, 0.0090, -0.0060],
[ 0.0023, 0.0172, 0.0163, ..., 0.0...42, 0.0131, ..., 0.0146, 0.0284, -0.0102],
[ 0.0338, -0.0124, 0.0132, ..., 0.0008, -0.0107, 0.0165]]])
kwargs = {'cache_position': tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20,..., 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77]), 'return_dict': True, 'token_type_ids': None, 'use_cache': True}
inputs = {'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1,...9, 40, 41, 36, 37, 38, 39, 40, 41, 36, 37, 38, 39,
40, 41, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]])}
def forward(
self,
input_ids: torch.LongTensor,
attention_mask: Optional[torch.LongTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.LongTensor] = None,
**kwargs,
):
self.compile()
inputs = self.prepare_inputs(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
position_ids=position_ids,
inputs_embeds=inputs_embeds,
**kwargs,
)
# Run inference
self.request.start_async(inputs, share_inputs=True)
> self.request.wait()
E RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
E Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
E Caught exception: bad_function_call
.venv/lib/python3.13/site-packages/optimum/intel/openvino/modeling_visual_language.py:223: RuntimeError
Hello, I'm trying to run a self-converted version of Qwen2-VL 2B on an Intel GPU, but I keep getting an exception below while the same code works totally fine on a CPU. I've tried to trace the input of each step in the Python side but cannot figure out what the difference is between inference on a CPU and a GPU. Sometimes it'll works after I refresh my venv then another failure occurs the next time.
Here's my code (partial of the class to handle inference)
with these ov_config
And I'm running this in Python 3.13.9 with the packages below. I also tried to run the code with the different composition of installed packages(also torch with different backend) and different model, but nothing changed.
(BTW, sometimes openvino 2025.2.0 will throw another error that occured on both CPU and GPU.)
Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:272: E [GPU] [CL_EXT] setArgUsm in KernelIntel failed, error code: -49 CL_INVALID_ARG_INDEXAnd the error