Cherry pick Habana software v1.21.0 commits #2180

xin3he · 2025-04-22T07:52:37Z

No description provided.

…[SW-214296] (#97) * [SW-205334] wrap calls to original module methods * add wrap for fetch_from_cache * fix PatchedKVCache test, revert the PatchedModule Base * PatchedParallelLMHead switch orig_linear_apply * fix PatchedKVCache --------- Co-authored-by: Rafal Litka <[email protected]> Signed-off-by: Xin He <[email protected]>

* [SW-218081] temproray disable fp8_static_quant test * skip test block wise measurements * dummy commit to re-trigger CI

* [FSW-12066] Multiple PT Bridge support Added abstraction for per device quantized func wrapper * [SW-12066] Add quantized func wrapper for multi-device Removed hpu code from quant_dequant add unit test * TEMP commit * Further adjustmetns Rebaesd above latests master_next added INC device enum (instead usage of device speicifc enums) removed unecceeray device specific imports Moved device specifc imports to internal scopes, or wraped with try-except * Add fixes after running guadi tests * added __init__ files and removed undeeded _init_ function * Call directly to get_quantized_func_wrapper_object and remove old hpu ops file * MOre fixes for xpu tests to work added import xpu_modules device for scale calculation added file headers * adjutments for dynamic MOE vllm * Adjutments after rebase of scale refactiorng * Rebase for 1.21, call in load API, refine factory class * rebase from master_next 8_2_25 * add init/clear calls in load api * refine factory class impl, use class methods and members only * refine func wrapper init file, use only api functions * Fixes after CR 10_02 * More Fixes after CR * Compare INC Acclerator enum value * fix var name * Added Matmul OP type * Rename gemm and matmul op types Signed-off-by: Xin He <[email protected]>

Signed-off-by: Xin He <[email protected]>

) * [ALGO-808] add support for int4 weights + fp8 activations - phase 1 * Add code for quantizing only single input to PatchedMatmul * w4a8 new kernel --------- Co-authored-by: Tomer Gafni <[email protected]>

* [SW-214378] remove creation of nc_workspace in each INC run * remove the commented line

…e measurements for scale calculation (#152)

…ear (#1) This commit modifies PatchedRowParallelLinear collective func to custom all_reduce function that: -During measurement- measures all_reduce output and matmul_fp8 maximum output. -During quantization- quantizes all_gather and all_to_all ops inside the all_reduce func as they preformed in fp8. Add branch in reduce_forward_quant to have the fp8 optimization be done only at decode phase --------- Co-authored-by: Roi Tiefenbrunn <[email protected]> Co-authored-by: Linoy Buchnik <[email protected]> Co-authored-by: linoy buchnik <[email protected]> Signed-off-by: Xin He <[email protected]>

* Add blockwise quantization for GPTQ * sharded checkpoint additions * CR fixes * CR fixes #2 * fix error caused in CI * update safetensors requirements file to support safetensors hpu support Signed-off-by: Xin He <[email protected]>

* Raise error when measuring PC without shapes * Update measure.py

…lLinear (#160)

Change-Id: I41a07985d602936e5d6c4f25a061a009bc251253 Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

* [SW-218484] Enhance log for saving Signed-off-by: Xin He <[email protected]> * fix Signed-off-by: Xin He <[email protected]> --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]> Signed-off-by: Xin He <[email protected]>

#167) Co-authored-by: Ivan Antonov <[email protected]>

* fp8 aware gptq (hybrid gptq) * review1 * loading bias to mixed low precision * fixing tests for fp8 aware quantization and hybrid re-ordering * Addressed second review round comments * Adressed review 3 comments --------- Co-authored-by: Asaf Karnieli <[email protected]>

…config.json (#181) * [SW-222513] OSError: does not appear to have a file named generation_config.json * Update save_load.py

This reverts commit 050dc44.

- Quantize model to W4A8 using auto-round - Loading W4A8 model --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Asaf Karnieli <[email protected]> Co-authored-by: Tomer Gafni <[email protected]>

…ons (#162) * [SW-219831] - Set scale attributes in INC to reduce grpah recompilation * add scaling methods ids * fix scaling method ids check and set * enable feature also for Load QuantMode * move scale tensors to cpu when feature is enabled * fix scaling methods ids to start at 1 * fix cr comments * remove unnecessary imports * fix cr comments * fix more cr comments * fix cr comments * move scale to float on cpu in scale handler for dynamic scaling * fix cr comments * Add unit test * fix sending scale tensor to bridge and unit-test bug

* refine PatchVLLMKVCache * move cache out of args * revert option2 * add get_cache * Revert "add get_cache" This reverts commit a89d9d23810ce594743504fea4bc5cd49e8d4192. * Revert "revert option2" This reverts commit d2b124c1d30717baf482eb887ba5ab3cb09ac51d. * add comments * update comment * Dummy commit for triggering CI * Dummy commit for triggering CI

Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>

* [SW-218081] move htcore.hpu_set_env() to confest Signed-off-by: Xin He <[email protected]> * Update conftest.py * use htcore.hpu_set_inference_env() Signed-off-by: Xin He <[email protected]> --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>

Copilot

Pull Request Overview

Cherry picking Habana software v1.21.0 commits to integrate new XPU and HPU quantized function wrappers and refactor quantization/dequantization flows.

Introduces new XPU and HPU wrapper classes and updates the factory API for device-specific quantized ops.
Refactors quantize, dequant_dequant, and patching logic and updates the example weight‐only quantization script accordingly.

Reviewed Changes

Copilot reviewed 72 out of 75 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py	Added XPU quantized function wrappers with TODO placeholders and minor spelling corrections.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py	Updated HPU quantized ops wrappers with refined naming and API usage.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrapper_api.py	Extended API to support device-specific factory initialization.
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py, quant_dequant.py	Updated quantization/dequantization flow to use new wrapper APIs.
neural_compressor/torch/algorithms/fp8_quant/_core/patching_common.py	Refactored module patching logic and updated module type mappings.
neural_compressor/common/base_config.py, utils, constants.py, examples/...	Minor adjustments in configuration and example scripts to accommodate new quantized op paths.

Files not reviewed (3)

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported

Comments suppressed due to low confidence (1)

neural_compressor/torch/algorithms/fp8_quant/_core/common.py:198

The function uses os.getenv but there is no import for 'os' in this file; please add 'import os' at the top of the file.

def is_runtime_scale_patching():

...r/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py

Signed-off-by: Xin He <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>

Copilot

Pull Request Overview

This PR cherry‐picks Habana software v1.21.0 commits and introduces new quantized function wrappers for XPU and updates the HPU wrappers along with modifications to quantization, dequantization, and patching logic. Key changes include new XPU quantized function wrapper implementations, adjustments to the use of quantized function wrappers in dequantization and quantization flows, and additional modifications in the common utilities and example scripts to support these updates.

Reviewed Changes

Copilot reviewed 73 out of 76 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py	Adds new XPU quantized function wrappers following standard patterns.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py	Refactors HPU wrapper classes and updates the mapping for op types.
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py	Updates the quantization flow including weight dequantization and flagging updated FP8 weights.
neural_compressor/torch/algorithms/fp8_quant/_core/quant_dequant.py	Modifies how casting operators are obtained and called in quantized dequantization modules.
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py	Introduces new arguments and adjustments to model loading/saving logic for GPTQ blockwise quantization.
Other files	Various updates to common utilities, logging, device handling, and configuration to support the new quantization operators.

Files not reviewed (3)

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported

* add preprocess_quant_config to collect common code Signed-off-by: Xin He <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Xin He <[email protected]>

RafLit and others added 30 commits April 22, 2025 08:50

[SW-218081] temproray disable fp8_static_quant test (#131)

7a39bc6

* [SW-218081] temproray disable fp8_static_quant test * skip test block wise measurements * dummy commit to re-trigger CI

[FSW-12066] small fixes in xpu quantized func (#145)

e4d1392

[SW-218197] fix bug in Mixtral unitscale (#139)

57fb381

Signed-off-by: Xin He <[email protected]>

[ALGO-808] add support for int4 weights + fp8 activations - phase 1 (#43

35bdd68

) * [ALGO-808] add support for int4 weights + fp8 activations - phase 1 * Add code for quantizing only single input to PatchedMatmul * w4a8 new kernel --------- Co-authored-by: Tomer Gafni <[email protected]>

[SW-218081] Re-enable tests (#140)

8d80ef6

[SW-214378] remove creation of nc_workspace in each INC run (#151)

9ecd4e3

* [SW-214378] remove creation of nc_workspace in each INC run * remove the commented line

[SW-219274] - Fix getting error log when lm_head in vLLM does not hav…

605e541

…e measurements for scale calculation (#152)

[SW-207602] Fix bug with PatchedRowParallelLinear (#158)

4b1b9d0

[SW-219745] fix fp8 GaudiMixtralSparseMoeBlock graph break (#161)

fe86345

Blockwise gptq (#155)

0d60143

* Add blockwise quantization for GPTQ * sharded checkpoint additions * CR fixes * CR fixes #2 * fix error caused in CI * update safetensors requirements file to support safetensors hpu support Signed-off-by: Xin He <[email protected]>

Raise error when measuring PC without shapes (#163)

ed0ba34

* Raise error when measuring PC without shapes * Update measure.py

[SW-218303] Fix incorrect bias addition point in PatchedColumnParalle…

5400bf4

…lLinear (#160)

Correct PatchedVLLMKVCache to measure the whole input (#170)

7c07d1c

Change-Id: I41a07985d602936e5d6c4f25a061a009bc251253 Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

[SW-222320] Optimize code for TPC fuser in dynamic quantization (#173)

d7fc673

[SW-221372] allow running Mixtral measurment phase using torch.compile (

51ffdd5

#167) Co-authored-by: Ivan Antonov <[email protected]>

[SW-222366] Switch tests to lazy mode (#174)

31282b7

[SW-222366] Move env default to init (#178)

be5f6f4

[SW-223106] Temporary disable mixed precision test (#180)

f9b5674

[SW-222513] OSError: does not appear to have a file named generation_…

137850f

…config.json (#181) * [SW-222513] OSError: does not appear to have a file named generation_config.json * Update save_load.py

Revert "fp8 aware gptq (hybrid gptq) (#154)" (#184)

7df63bd

This reverts commit 050dc44.

[SW-216623] Restore patch module to original before convert (#185)

7ed312a

Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>

xin3he assigned thuang6 and unassigned thuang6 Apr 25, 2025

xin3he requested review from thuang6 and XuehaoSun April 25, 2025 02:59

Copilot AI reviewed Apr 25, 2025

View reviewed changes

...r/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py Outdated Show resolved Hide resolved

xin3he force-pushed the cherry_pick_v1.21 branch from adbcfbf to d34d436 Compare April 25, 2025 03:03

add back missing changes in cherry-pick

d198b96

Signed-off-by: Xin He <[email protected]>

xin3he force-pushed the cherry_pick_v1.21 branch from 6c0d3ca to f69f38e Compare April 25, 2025 03:36

add docstring and fix typo

1008215

Signed-off-by: Xin He <[email protected]>

xin3he force-pushed the cherry_pick_v1.21 branch from a626f6e to bf12522 Compare April 25, 2025 03:46

fix CI failure caused by internal changes

66fa4f2

Signed-off-by: Xin He <[email protected]>

xin3he force-pushed the cherry_pick_v1.21 branch from e27f3dc to 66fa4f2 Compare April 25, 2025 03:49

pre-commit-ci bot and others added 2 commits April 25, 2025 03:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

7369356

for more information, see https://pre-commit.ci

support update config after initialization (#2191)

79ede24

Signed-off-by: Xin He <[email protected]> Co-authored-by: Xin He <[email protected]>

yiliu30 requested a review from Copilot April 25, 2025 05:23

Copilot AI reviewed Apr 25, 2025

View reviewed changes

xin3he and others added 6 commits April 25, 2025 14:35

skip fp8 xpu path in CI

0b0e352

Signed-off-by: Xin He <[email protected]>

remove htcore.hpu_inference_set_env to suit 1.20

47633b3

Signed-off-by: Xin He <[email protected]>

fix typo

97e67c2

Signed-off-by: Xin He <[email protected]>

support ComposableConfig setattr

f5d3d17

Signed-off-by: Xin He <[email protected]>

workaround for v1.20 missing attribution

7cba14a

Signed-off-by: Xin He <[email protected]>

xin3he force-pushed the cherry_pick_v1.21 branch from 64f96cf to 7cba14a Compare April 28, 2025 05:57

thuang6 approved these changes Apr 29, 2025

View reviewed changes

chensuyue added this to the 3.4 milestone Apr 29, 2025

fix bug in previous UT

7f830d1

Signed-off-by: Xin He <[email protected]>

xin3he force-pushed the cherry_pick_v1.21 branch from 69b1837 to 7f830d1 Compare April 29, 2025 08:25

XuehaoSun approved these changes May 13, 2025

View reviewed changes

XuehaoSun merged commit fa0b6d1 into master May 13, 2025
23 of 27 checks passed

XuehaoSun deleted the cherry_pick_v1.21 branch May 13, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry pick Habana software v1.21.0 commits #2180

Cherry pick Habana software v1.21.0 commits #2180

Uh oh!

xin3he commented Apr 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Cherry pick Habana software v1.21.0 commits #2180

Cherry pick Habana software v1.21.0 commits #2180

Uh oh!

Conversation

xin3he commented Apr 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!