[NPUW] Initial support for Gemma3 on NPU #32102

AlexanderKalistratov · 2025-09-17T03:48:48Z

Fixed logic of switching between prefill and generate stage for gemma3
Add support for new input - token_type_ids

…emma3 & handling of token_types_ids input

AsyaPronina · 2025-09-17T21:47:29Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp


 void ov::npuw::LLMInferRequest::prepare_for_new_conversation() {
    fill_tensor_bytes(m_prefill_request->get_tensor(m_prefill_in_ports.at(m_input_ids_name)), 0u);
+    if (auto totyids_port = m_prefill_in_ports.find(layer_names::token_type_ids);


Nitpick: suggest to rename to type_ids_port

AsyaPronina · 2025-09-17T21:48:57Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

-    std::unordered_map<std::string, ov::Output<const ov::Node>> in_ports,
-    std::unordered_map<std::string, ov::Output<const ov::Node>> out_ports,
+    const std::unordered_map<std::string, ov::Output<const ov::Node>>& in_ports,
+    const std::unordered_map<std::string, ov::Output<const ov::Node>>& out_ports,


AsyaPronina · 2025-09-17T21:56:33Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

    auto attn_mask_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::attention_mask));
    auto pos_ids_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::position_ids));

+    auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr();


Nitpick: suggest to rename to just types_ids_in_tensor

Nitpick: somewhere it type and not types

AsyaPronina · 2025-09-17T21:58:36Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp


+    auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr();
+
+    if (auto ttis_port = m_prefill_in_ports.find(layer_names::token_type_ids); ttis_port != m_prefill_in_ports.end()) {


Nitpick: suggest to rename to type_ids_port

AsyaPronina · 2025-09-17T22:03:51Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

        std::copy_n(attention_mask->data<int64_t>() + kvcache_desc.num_stored_tokens,
                    current_prompts_len,
                    attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len);
+        if (to_ty_ids_in_tensor) {


We are also updating the attention mask after each call on chunk before the next iteration in the end of the loop:

// Update attention mask for the next iteration std::copy_n(attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len, current_prompts_len, attn_mask_in_tensor->data<int64_t>() + kvcache_desc.num_stored_tokens - current_prompts_len);

Do we need to do this also for the token_type_ids?

AsyaPronina · 2025-09-17T22:25:54Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

                                              ov::SoPtr<ov::ITensor> attention_mask,
-                                              ov::SoPtr<ov::ITensor> position_ids) {
+                                              ov::SoPtr<ov::ITensor> position_ids,
+                                              ov::SoPtr<ov::ITensor> token_types_ids) {


Nitpick: Somewhere it is type and not types

AsyaPronina · 2025-09-17T22:41:20Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

        input_ids->get_byte_size(),
        reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size());

+    if (token_types_ids) {


I think we need to additionally fill that tensor with 0s under if (!m_generate_initialized) condition above (as done to other inputs), if this token_type_ids behave like attention_mask and contain data for the whole context.

Do them behave like attention_mask? (I thought they do because of the code in infer_chunked_prefill())

AsyaPronina · 2025-09-17T22:42:13Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

        reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size());

+    if (token_types_ids) {
+        auto r_token_type_ids = m_kvcache_request->get_tensor(m_kvcache_in_ports.at(layer_names::token_type_ids));


AsyaPronina · 2025-09-17T22:46:58Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp


+    if (m_first_run) {
+        // Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer
+        // But gemma3 has it's == 1


Nitpick: do we need 's here?

AsyaPronina · 2025-09-17T22:48:18Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+    if (m_first_run) {
+        // Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer
+        // But gemma3 has it's == 1
+        // We need to store original zero position id in order to distinguish between prefill and generate stage


Might be we can call it first position id, but feel free to skip this comment

Fix logic of distinquishing between prefill and generate stages for g…

401e494

…emma3 & handling of token_types_ids input

AlexanderKalistratov requested review from a team as code owners September 17, 2025 03:48

AlexanderKalistratov requested a review from AsyaPronina September 17, 2025 03:49

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Sep 17, 2025

AlexanderKalistratov changed the title ~~[NPUW] Initial support for Gemma3~~ [NPUW] Initial support for Gemma3 on NPU Sep 17, 2025

AlexanderKalistratov added 2 commits September 17, 2025 12:34

clang-format

0233897

fix

f51d648

AsyaPronina reviewed Sep 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPUW] Initial support for Gemma3 on NPU #32102

[NPUW] Initial support for Gemma3 on NPU #32102

AlexanderKalistratov commented Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

AsyaPronina Sep 17, 2025

Uh oh!

Uh oh!


		auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr();

		if (auto ttis_port = m_prefill_in_ports.find(layer_names::token_type_ids); ttis_port != m_prefill_in_ports.end()) {

[NPUW] Initial support for Gemma3 on NPU #32102

Are you sure you want to change the base?

[NPUW] Initial support for Gemma3 on NPU #32102

Conversation

AlexanderKalistratov commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!