-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[NPUW] Initial support for Gemma3 on NPU #32102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Initial support for Gemma3 on NPU #32102
Conversation
…emma3 & handling of token_types_ids input
|
||
void ov::npuw::LLMInferRequest::prepare_for_new_conversation() { | ||
fill_tensor_bytes(m_prefill_request->get_tensor(m_prefill_in_ports.at(m_input_ids_name)), 0u); | ||
if (auto totyids_port = m_prefill_in_ports.find(layer_names::token_type_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: suggest to rename to type_ids_port
std::unordered_map<std::string, ov::Output<const ov::Node>> in_ports, | ||
std::unordered_map<std::string, ov::Output<const ov::Node>> out_ports, | ||
const std::unordered_map<std::string, ov::Output<const ov::Node>>& in_ports, | ||
const std::unordered_map<std::string, ov::Output<const ov::Node>>& out_ports, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
auto attn_mask_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::attention_mask)); | ||
auto pos_ids_in_tensor = m_prefill_request->get_tensor(m_prefill_in_ports.at(layer_names::position_ids)); | ||
|
||
auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: suggest to rename to just types_ids_in_tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: somewhere it type
and not types
|
||
auto to_ty_ids_in_tensor = ov::npuw::util::TensorPtr(); | ||
|
||
if (auto ttis_port = m_prefill_in_ports.find(layer_names::token_type_ids); ttis_port != m_prefill_in_ports.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: suggest to rename to type_ids_port
std::copy_n(attention_mask->data<int64_t>() + kvcache_desc.num_stored_tokens, | ||
current_prompts_len, | ||
attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len); | ||
if (to_ty_ids_in_tensor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are also updating the attention mask after each call on chunk before the next iteration in the end of the loop:
// Update attention mask for the next iteration
std::copy_n(attn_mask_in_tensor->data<int64_t>() + attn_mask_in_tensor->get_size() - current_prompts_len,
current_prompts_len,
attn_mask_in_tensor->data<int64_t>() + kvcache_desc.num_stored_tokens - current_prompts_len);
Do we need to do this also for the token_type_ids
?
ov::SoPtr<ov::ITensor> attention_mask, | ||
ov::SoPtr<ov::ITensor> position_ids) { | ||
ov::SoPtr<ov::ITensor> position_ids, | ||
ov::SoPtr<ov::ITensor> token_types_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Somewhere it is type
and not types
input_ids->get_byte_size(), | ||
reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size()); | ||
|
||
if (token_types_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to additionally fill that tensor with 0s under if (!m_generate_initialized)
condition above (as done to other inputs), if this token_type_ids
behave like attention_mask
and contain data for the whole context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do them behave like attention_mask
? (I thought they do because of the code in infer_chunked_prefill()
)
reinterpret_cast<uint8_t*>(kv_input_ids->data()) + kv_input_ids->get_byte_size() - input_ids->get_byte_size()); | ||
|
||
if (token_types_ids) { | ||
auto r_token_type_ids = m_kvcache_request->get_tensor(m_kvcache_in_ports.at(layer_names::token_type_ids)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why r
?
|
||
if (m_first_run) { | ||
// Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer | ||
// But gemma3 has it's == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: do we need 's
here?
if (m_first_run) { | ||
// Most of the models have position_ids->data<int64_t>()[0] == 0 for the first infer | ||
// But gemma3 has it's == 1 | ||
// We need to store original zero position id in order to distinguish between prefill and generate stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be we can call it first position id
, but feel free to skip this comment
token_type_ids