Skip to content

Cherry pick Habana software v1.21.0 commits #2180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open

Conversation

xin3he
Copy link
Contributor

@xin3he xin3he commented Apr 22, 2025

No description provided.

RafLit and others added 30 commits April 22, 2025 08:50
…[SW-214296] (#97)

* [SW-205334] wrap calls to original module methods

* add wrap for fetch_from_cache

* fix PatchedKVCache test, revert the PatchedModule
Base

* PatchedParallelLMHead switch orig_linear_apply

* fix PatchedKVCache

---------

Co-authored-by: Rafal Litka <[email protected]>
Signed-off-by: Xin He <[email protected]>
* [SW-218081] temproray disable fp8_static_quant test

* skip test block wise measurements

* dummy commit to re-trigger CI
* [FSW-12066] Multiple PT Bridge support

Added abstraction for per device quantized func wrapper

* [SW-12066] Add quantized func wrapper for multi-device

Removed hpu code from quant_dequant
add unit test

* TEMP commit

* Further adjustmetns

Rebaesd above latests master_next
added INC device enum (instead usage of device speicifc enums)
removed unecceeray device specific imports
Moved device specifc imports to internal scopes, or wraped with try-except

* Add fixes after running guadi tests

* added __init__ files and removed undeeded _init_ function

* Call directly to get_quantized_func_wrapper_object

and remove old hpu ops file

* MOre fixes for xpu tests to work

added import xpu_modules
device for scale calculation
added file headers

* adjutments for dynamic MOE vllm

* Adjutments after rebase of scale refactiorng

* Rebase for 1.21, call in load API, refine factory class

* rebase from master_next 8_2_25
* add init/clear calls in load api
* refine factory class impl, use class methods and members only
* refine func wrapper init file, use only api functions

* Fixes after CR 10_02

* More Fixes after CR

* Compare INC Acclerator enum value

* fix var name

* Added Matmul OP type

* Rename gemm and matmul op types

Signed-off-by: Xin He <[email protected]>
)

* [ALGO-808] add support for int4 weights + fp8 activations - phase 1

* Add code for quantizing only single input to PatchedMatmul

* w4a8 new kernel

---------

Co-authored-by: Tomer Gafni <[email protected]>
* [SW-214378] remove creation of nc_workspace in each INC run

* remove the commented line
…ear (#1)

This commit modifies PatchedRowParallelLinear collective func to custom
all_reduce function that:
-During measurement- measures all_reduce output and matmul_fp8 maximum
output.
-During quantization- quantizes all_gather and all_to_all ops inside the all_reduce
func as they preformed in fp8.
Add branch in reduce_forward_quant to have the fp8 optimization be done only at decode phase

---------

Co-authored-by: Roi Tiefenbrunn <[email protected]>
Co-authored-by: Linoy Buchnik <[email protected]>
Co-authored-by: linoy buchnik <[email protected]>
Signed-off-by: Xin He <[email protected]>
* Add blockwise quantization for GPTQ

* sharded checkpoint additions

* CR fixes

* CR fixes #2

* fix error caused in CI

* update safetensors requirements file to support safetensors hpu support

Signed-off-by: Xin He <[email protected]>
* Raise error when measuring PC without shapes

* Update measure.py
Change-Id: I41a07985d602936e5d6c4f25a061a009bc251253

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
* [SW-218484] Enhance log for saving

Signed-off-by: Xin He <[email protected]>

* fix

Signed-off-by: Xin He <[email protected]>

---------

Signed-off-by: Xin He <[email protected]>
Co-authored-by: Xin He <[email protected]>
Signed-off-by: Xin He <[email protected]>
* fp8 aware gptq (hybrid gptq)

* review1

* loading bias to mixed low precision

* fixing tests for fp8 aware quantization and hybrid re-ordering

* Addressed second review round comments

* Adressed review 3 comments

---------

Co-authored-by: Asaf Karnieli <[email protected]>
…config.json (#181)

* [SW-222513] OSError: does not appear to have a file named generation_config.json

* Update save_load.py
- Quantize model to W4A8 using auto-round
- Loading W4A8 model

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: Asaf Karnieli <[email protected]>
Co-authored-by: Tomer Gafni <[email protected]>
…ons (#162)

* [SW-219831] - Set scale attributes in INC to reduce grpah recompilation

* add scaling methods ids

* fix scaling method ids check and set

* enable feature also for Load QuantMode

* move scale tensors to cpu when feature is enabled

* fix scaling methods ids to start at 1

* fix cr comments

* remove unnecessary imports

* fix cr comments

* fix more cr comments

* fix cr comments

* move scale to float on cpu in scale handler for dynamic scaling

* fix cr comments

* Add unit test

* fix sending scale tensor to bridge and unit-test bug
* refine PatchVLLMKVCache

* move cache out of args

* revert option2

* add get_cache

* Revert "add get_cache"

This reverts commit a89d9d23810ce594743504fea4bc5cd49e8d4192.

* Revert "revert option2"

This reverts commit d2b124c1d30717baf482eb887ba5ab3cb09ac51d.

* add comments

* update comment

* Dummy commit for triggering CI

* Dummy commit for triggering CI
* [SW-218081] move htcore.hpu_set_env() to confest

Signed-off-by: Xin He <[email protected]>

* Update conftest.py

* use htcore.hpu_set_inference_env()

Signed-off-by: Xin He <[email protected]>

---------

Signed-off-by: Xin He <[email protected]>
Co-authored-by: Xin He <[email protected]>
@xin3he xin3he force-pushed the cherry_pick_v1.21 branch 5 times, most recently from 968b6bd to 52d8b3c Compare April 25, 2025 02:41
@xin3he xin3he assigned thuang6 and unassigned thuang6 Apr 25, 2025
@xin3he xin3he requested review from thuang6 and XuehaoSun April 25, 2025 02:59
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Cherry picking Habana software v1.21.0 commits to integrate new XPU and HPU quantized function wrappers and refactor quantization/dequantization flows.

  • Introduces new XPU and HPU wrapper classes and updates the factory API for device-specific quantized ops.
  • Refactors quantize, dequant_dequant, and patching logic and updates the example weight‐only quantization script accordingly.

Reviewed Changes

Copilot reviewed 72 out of 75 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py Added XPU quantized function wrappers with TODO placeholders and minor spelling corrections.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py Updated HPU quantized ops wrappers with refined naming and API usage.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrapper_api.py Extended API to support device-specific factory initialization.
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py, quant_dequant.py Updated quantization/dequantization flow to use new wrapper APIs.
neural_compressor/torch/algorithms/fp8_quant/_core/patching_common.py Refactored module patching logic and updated module type mappings.
neural_compressor/common/base_config.py, utils, constants.py, examples/... Minor adjustments in configuration and example scripts to accommodate new quantized op paths.
Files not reviewed (3)
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported
Comments suppressed due to low confidence (1)

neural_compressor/torch/algorithms/fp8_quant/_core/common.py:198

  • The function uses os.getenv but there is no import for 'os' in this file; please add 'import os' at the top of the file.
def is_runtime_scale_patching():

@xin3he xin3he force-pushed the cherry_pick_v1.21 branch from adbcfbf to d34d436 Compare April 25, 2025 03:03
@xin3he xin3he force-pushed the cherry_pick_v1.21 branch from 6c0d3ca to f69f38e Compare April 25, 2025 03:36
@xin3he xin3he force-pushed the cherry_pick_v1.21 branch from a626f6e to bf12522 Compare April 25, 2025 03:46
@xin3he xin3he force-pushed the cherry_pick_v1.21 branch from e27f3dc to 66fa4f2 Compare April 25, 2025 03:49
@yiliu30 yiliu30 requested a review from Copilot April 25, 2025 05:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR cherry‐picks Habana software v1.21.0 commits and introduces new quantized function wrappers for XPU and updates the HPU wrappers along with modifications to quantization, dequantization, and patching logic. Key changes include new XPU quantized function wrapper implementations, adjustments to the use of quantized function wrappers in dequantization and quantization flows, and additional modifications in the common utilities and example scripts to support these updates.

Reviewed Changes

Copilot reviewed 73 out of 76 changed files in this pull request and generated no comments.

Show a summary per file
File Description
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/xpu/xpu_quantized_func_wrapper.py Adds new XPU quantized function wrappers following standard patterns.
neural_compressor/torch/algorithms/fp8_quant/_core/quantized_func_wrappers/hpu/hpu_quantized_func_wrapper.py Refactors HPU wrapper classes and updates the mapping for op types.
neural_compressor/torch/algorithms/fp8_quant/_core/quantize.py Updates the quantization flow including weight dequantization and flagging updated FP8 weights.
neural_compressor/torch/algorithms/fp8_quant/_core/quant_dequant.py Modifies how casting operators are obtained and called in quantized dequantization modules.
examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py Introduces new arguments and adjustments to model loading/saving logic for GPTQ blockwise quantization.
Other files Various updates to common utilities, logging, device handling, and configuration to support the new quantization operators.
Files not reviewed (3)
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/requirements.txt: Language not supported
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_benchmark.sh: Language not supported
  • examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_quant.sh: Language not supported

xin3he and others added 6 commits April 25, 2025 14:35
* add preprocess_quant_config to collect common code

Signed-off-by: Xin He <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Xin He <[email protected]>
Co-authored-by: Xin He <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Xin He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.