Skip to content

Conversation

AlexanderKalistratov
Copy link
Contributor

  • Fixed logic of switching between prefill and generate stage for gemma3
  • Add support for new input - token_type_ids

@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Sep 17, 2025
@AlexanderKalistratov AlexanderKalistratov changed the title [NPUW] Initial support for Gemma3 [NPUW] Initial support for Gemma3 on NPU Sep 17, 2025

void ov::npuw::LLMInferRequest::prepare_for_new_conversation() {
fill_tensor_bytes(m_prefill_request->get_tensor(m_prefill_in_ports.at(m_input_ids_name)), 0u);
if (auto totyids_port = m_prefill_in_ports.find(layer_names::token_type_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: suggest to rename to type_ids_port

std::unordered_map<std::string, ov::Output<const ov::Node>> in_ports,
std::unordered_map<std::string, ov::Output<const ov::Node>> out_ports,
const std::unordered_map<std::string, ov::Output<const ov::Node>>& in_ports,
const std::unordered_map<std::string, ov::Output<const ov::Node>>& out_ports,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

auto attn_mask_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::attention_mask));
auto pos_ids_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::position_ids));

auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: suggest to rename to just types_ids_in_tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: somewhere it type and not types


auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr();

if (auto ttis_port = m_prefill_in_ports.find(layer_names::token_type_ids); ttis_port != m_prefill_in_ports.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: suggest to rename to type_ids_port

std::copy_n(attention_mask->data<int64_t>() + kvcache_desc.num_stored_tokens,
current_prompts_len,
attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len);
if (to_ty_ids_in_tensor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also updating the attention mask after each call on chunk before the next iteration in the end of the loop:

        // Update attention mask for the next iteration
        std::copy_n(attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len,
                    current_prompts_len,
                    attn_mask_in_tensor->data<int64_t>() + kvcache_desc.num_stored_tokens - current_prompts_len);

Do we need to do this also for the token_type_ids?

ov::SoPtr<ov::ITensor> attention_mask,
ov::SoPtr<ov::ITensor> position_ids) {
ov::SoPtr<ov::ITensor> position_ids,
ov::SoPtr<ov::ITensor> token_types_ids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Somewhere it is type and not types

input_ids->get_byte_size(),
reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size());

if (token_types_ids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to additionally fill that tensor with 0s under if (!m_generate_initialized) condition above (as done to other inputs), if this token_type_ids behave like attention_mask and contain data for the whole context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do them behave like attention_mask? (I thought they do because of the code in infer_chunked_prefill())

reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size());

if (token_types_ids) {
auto r_token_type_ids = m_kvcache_request->get_tensor(m_kvcache_in_ports.at(layer_names::token_type_ids));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why r?


if (m_first_run) {
// Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer
// But gemma3 has it's == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: do we need 's here?

if (m_first_run) {
// Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer
// But gemma3 has it's == 1
// We need to store original zero position id in order to distinguish between prefill and generate stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be we can call it first position id, but feel free to skip this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants