Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

luo-cheng2021
Copy link
Contributor

Change kvcache default type of PagedAttention to u8 for CPU plugin to aligned SDPA behaviour.

@github-actions github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms labels Nov 13, 2024
@ilya-lavrenov ilya-lavrenov added this to the 2025.0 milestone Nov 14, 2024
@ilya-lavrenov
Copy link
Contributor

@luo-cheng2021 could you please also include reverting of #1212 ?
I have hardcoded OpenVINO commit before u8 KV cache migration on CPU to unlock GenAI development.

@github-actions github-actions bot added the category: GHA CI based on Github actions label Nov 14, 2024
@ilya-lavrenov ilya-lavrenov removed the category: sampling Sampling / Decoding algorithms label Nov 20, 2024
@ilya-lavrenov ilya-lavrenov self-assigned this Nov 26, 2024
@ilya-lavrenov
Copy link
Contributor

@luo-cheng2021
Could you please rebase this PR on top of current master?
BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

@github-actions github-actions bot removed the category: GHA CI based on Github actions label Dec 31, 2024
@luo-cheng2021
Copy link
Contributor Author

luo-cheng2021 commented Dec 31, 2024

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

@ilya-lavrenov
Copy link
Contributor

Please, fix such places in tests as well

ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision);
ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;

and
https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/cpp/scheduler.cpp#L21-L43

I think we can introduce a single function with dummy model to have the same code in a single place.

@ilya-lavrenov
Copy link
Contributor

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

It's still strange that:

  • PA with fp16 kv_cache compares will with SPDA with int8 kv_cache - current master
  • both PA and SPDA with int8 kv_cache shows differences on multiple models - current PR.

Don't we have some bugs?

@luo-cheng2021
Copy link
Contributor Author

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

It's still strange that:

  • PA with fp16 kv_cache compares will with SPDA with int8 kv_cache - current master
  • both PA and SPDA with int8 kv_cache shows differences on multiple models - current PR.

Don't we have some bugs?

According to the 157863, the error caused by the accumulated float errors and the diverge would appear sooner or later. In the ticket there were no bugs found.

@ilya-lavrenov
Copy link
Contributor

ilya-lavrenov commented Jan 6, 2025

Merging changes from #1485 to check whether tests will pass

BTW, I expect that speculative decoding will fail anyway, because it compares SDPA vs PA and they are not matching.

ilya-lavrenov added a commit that referenced this pull request Jan 6, 2025
…elines (#1485)

OpenVINO plugins enable different kind of optimizations by default like
KV cache compression to int8, fp16 inference precision, while in GenAI
tests we want to test pipelines and how they are compared against HF /
optimum w/o extra optimizations:


https://github.com/openvinotoolkit/openvino.genai/blob/4db67aecac78885c6d1e302f348c9489e2154388/tests/python_tests/common.py#L318-L325

Hopefully, we can merge int8 KV cache by default for CB then
#1206, because in
tests we will still compare FP16 KV cache, while official Validation
should be responsible for validation against reference via WWB metrics.
@luo-cheng2021
Copy link
Contributor Author

[CPU]PageAttn with 4bit-quantization will add group quantization for u8/u4 kvcache, after it's merged, current computation of the kvcache size will be changed, and default u8 kvcache path for continuous batching must be changed(default group size will be 32).
So, this PR should be merged with #1366 together to meet the 4bit PR changes.

@luo-cheng2021
Copy link
Contributor Author

The function has been merged into #1366.

@luo-cheng2021 luo-cheng2021 marked this pull request as ready for review January 8, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants