-
Notifications
You must be signed in to change notification settings - Fork 265
Cherry pick Habana software v1.21.0 commits #2180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…[SW-214296] (#97) * [SW-205334] wrap calls to original module methods * add wrap for fetch_from_cache * fix PatchedKVCache test, revert the PatchedModule Base * PatchedParallelLMHead switch orig_linear_apply * fix PatchedKVCache --------- Co-authored-by: Rafal Litka <[email protected]> Signed-off-by: Xin He <[email protected]>
* [SW-218081] temproray disable fp8_static_quant test * skip test block wise measurements * dummy commit to re-trigger CI
* [FSW-12066] Multiple PT Bridge support Added abstraction for per device quantized func wrapper * [SW-12066] Add quantized func wrapper for multi-device Removed hpu code from quant_dequant add unit test * TEMP commit * Further adjustmetns Rebaesd above latests master_next added INC device enum (instead usage of device speicifc enums) removed unecceeray device specific imports Moved device specifc imports to internal scopes, or wraped with try-except * Add fixes after running guadi tests * added __init__ files and removed undeeded _init_ function * Call directly to get_quantized_func_wrapper_object and remove old hpu ops file * MOre fixes for xpu tests to work added import xpu_modules device for scale calculation added file headers * adjutments for dynamic MOE vllm * Adjutments after rebase of scale refactiorng * Rebase for 1.21, call in load API, refine factory class * rebase from master_next 8_2_25 * add init/clear calls in load api * refine factory class impl, use class methods and members only * refine func wrapper init file, use only api functions * Fixes after CR 10_02 * More Fixes after CR * Compare INC Acclerator enum value * fix var name * Added Matmul OP type * Rename gemm and matmul op types Signed-off-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
) * [ALGO-808] add support for int4 weights + fp8 activations - phase 1 * Add code for quantizing only single input to PatchedMatmul * w4a8 new kernel --------- Co-authored-by: Tomer Gafni <[email protected]>
* [SW-214378] remove creation of nc_workspace in each INC run * remove the commented line
…e measurements for scale calculation (#152)
…ear (#1) This commit modifies PatchedRowParallelLinear collective func to custom all_reduce function that: -During measurement- measures all_reduce output and matmul_fp8 maximum output. -During quantization- quantizes all_gather and all_to_all ops inside the all_reduce func as they preformed in fp8. Add branch in reduce_forward_quant to have the fp8 optimization be done only at decode phase --------- Co-authored-by: Roi Tiefenbrunn <[email protected]> Co-authored-by: Linoy Buchnik <[email protected]> Co-authored-by: linoy buchnik <[email protected]> Signed-off-by: Xin He <[email protected]>
* Add blockwise quantization for GPTQ * sharded checkpoint additions * CR fixes * CR fixes #2 * fix error caused in CI * update safetensors requirements file to support safetensors hpu support Signed-off-by: Xin He <[email protected]>
* Raise error when measuring PC without shapes * Update measure.py
Change-Id: I41a07985d602936e5d6c4f25a061a009bc251253 Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
* [SW-218484] Enhance log for saving Signed-off-by: Xin He <[email protected]> * fix Signed-off-by: Xin He <[email protected]> --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]> Signed-off-by: Xin He <[email protected]>
#167) Co-authored-by: Ivan Antonov <[email protected]>
* fp8 aware gptq (hybrid gptq) * review1 * loading bias to mixed low precision * fixing tests for fp8 aware quantization and hybrid re-ordering * Addressed second review round comments * Adressed review 3 comments --------- Co-authored-by: Asaf Karnieli <[email protected]>
…config.json (#181) * [SW-222513] OSError: does not appear to have a file named generation_config.json * Update save_load.py
- Quantize model to W4A8 using auto-round - Loading W4A8 model --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Asaf Karnieli <[email protected]> Co-authored-by: Tomer Gafni <[email protected]>
…ons (#162) * [SW-219831] - Set scale attributes in INC to reduce grpah recompilation * add scaling methods ids * fix scaling method ids check and set * enable feature also for Load QuantMode * move scale tensors to cpu when feature is enabled * fix scaling methods ids to start at 1 * fix cr comments * remove unnecessary imports * fix cr comments * fix more cr comments * fix cr comments * move scale to float on cpu in scale handler for dynamic scaling * fix cr comments * Add unit test * fix sending scale tensor to bridge and unit-test bug
* refine PatchVLLMKVCache * move cache out of args * revert option2 * add get_cache * Revert "add get_cache" This reverts commit a89d9d23810ce594743504fea4bc5cd49e8d4192. * Revert "revert option2" This reverts commit d2b124c1d30717baf482eb887ba5ab3cb09ac51d. * add comments * update comment * Dummy commit for triggering CI * Dummy commit for triggering CI
Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>
* [SW-218081] move htcore.hpu_set_env() to confest Signed-off-by: Xin He <[email protected]> * Update conftest.py * use htcore.hpu_set_inference_env() Signed-off-by: Xin He <[email protected]> --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>
968b6bd
to
52d8b3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Cherry picking Habana software v1.21.0 commits to integrate new XPU and HPU quantized function wrappers and refactor quantization/dequantization flows.
- Introduces new XPU and HPU wrapper classes and updates the factory API for device-specific quantized ops.
- Refactors quantize, dequant_dequant, and patching logic and updates the example weight‐only quantization script accordingly.
Reviewed Changes
Copilot reviewed 72 out of 75 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py | Added XPU quantized function wrappers with TODO placeholders and minor spelling corrections. |
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py | Updated HPU quantized ops wrappers with refined naming and API usage. |
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrapper_api.py | Extended API to support device-specific factory initialization. |
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py, quant_dequant.py | Updated quantization/dequantization flow to use new wrapper APIs. |
neural_compressor/torch/algorithms/fp8_quant/_core/patching_common.py | Refactored module patching logic and updated module type mappings. |
neural_compressor/common/base_config.py, utils, constants.py, examples/... | Minor adjustments in configuration and example scripts to accommodate new quantized op paths. |
Files not reviewed (3)
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported
Comments suppressed due to low confidence (1)
neural_compressor/torch/algorithms/fp8_quant/_core/common.py:198
- The function uses os.getenv but there is no import for 'os' in this file; please add 'import os' at the top of the file.
def is_runtime_scale_patching():
...r/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py
Outdated
Show resolved
Hide resolved
adbcfbf
to
d34d436
Compare
Signed-off-by: Xin He <[email protected]>
6c0d3ca
to
f69f38e
Compare
Signed-off-by: Xin He <[email protected]>
a626f6e
to
bf12522
Compare
Signed-off-by: Xin He <[email protected]>
e27f3dc
to
66fa4f2
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR cherry‐picks Habana software v1.21.0 commits and introduces new quantized function wrappers for XPU and updates the HPU wrappers along with modifications to quantization, dequantization, and patching logic. Key changes include new XPU quantized function wrapper implementations, adjustments to the use of quantized function wrappers in dequantization and quantization flows, and additional modifications in the common utilities and example scripts to support these updates.
Reviewed Changes
Copilot reviewed 73 out of 76 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py | Adds new XPU quantized function wrappers following standard patterns. |
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py | Refactors HPU wrapper classes and updates the mapping for op types. |
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py | Updates the quantization flow including weight dequantization and flagging updated FP8 weights. |
neural_compressor/torch/algorithms/fp8_quant/_core/quant_dequant.py | Modifies how casting operators are obtained and called in quantized dequantization modules. |
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py | Introduces new arguments and adjustments to model loading/saving logic for GPTQ blockwise quantization. |
Other files | Various updates to common utilities, logging, device handling, and configuration to support the new quantization operators. |
Files not reviewed (3)
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
- examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported
* add preprocess_quant_config to collect common code Signed-off-by: Xin He <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
No description provided.