[Tracing] Code AutoWrapper #1411

kylesayrs · 2025-05-04T22:36:14Z

Purpose

Reduce model support burden by automatically wrapping untraceable code
- This is a programmatic implementation of all of the rules specified by the tracing guide
Remove traceable definitions for Idefics3ForConditionalGeneration, LlavaForConditionalGeneration, MllamaForConditionalGeneration, LlavaForConditionalGeneration, Qwen2_5_VLForConditionalGeneration, and Qwen2VLForConditionalGeneration, Gemma3ForConditionalGeneration (all of them)

Fixes

Tracing tests failing with latest transformers>4.52.0 #1457

Autowrap Patterns

These patterns match syntax which is untraceable and unlikely to call sequential targets (either directly or indirectly)

If statements whose conditions cannot be statically evaluated

This if statement can be statically evaluated, since its value can be evaluated in the context of {"self": LlamaModel(...)}

if self.config._attn_implementation != "eager":
    ...

If the statement cannot be statically evaluated, then it is wrapped

torch.fx.wrap
def wrapped(input_ids, inputs_embeds):
    if (input_ids is None) ^ (inputs_embeds is not None):
        raise ValueError("You must specify exactly one of input_ids or inputs_embeds")

def forward(...):
    (,) = wrapped(input_ids, inputs_embeds)

Ignored functions (`_update_causal_mask`)

Any function or method names listed in the ignore list will be automatically wrapped

torch.fx.wrap
def wrapped(attention_mask, inputs_embeds, cache_position, ...):
    return self._update_causal_mask(attention_mask, inputs_embeds, cache_position, ...)

def forward(...):
    causal_mask = wrapped(attention_mask, inputs_embeds, cache_position, ...)

Starred tuple unpacking

Any use of iterated unpacking will be automatically wrapped

torch.fx.wrap
def wrapped(input_shape):
    return (*input_shape, -1, self.head_dim)

def forward(...):
    hidden_shape = wrapped(input_shape)

Starred argument unpacking

Any use of iterated unpacking into variadic args is automatically wrapped

torch.fx.wrap
def wrapped(attn_output, input_shape):
    return attn_output.reshape(*input_shape, -1)

def forward(...):
    attn_output = wrapped(attn_output, input_shape)

Autowrap Implementation Details

Wrapper arguments

Autowrapping a piece of code requires determining which variable names are used by that code and which variable names are produced by that code. This is done using the `NameAnalyzer`, which determines the unbound, assigned, and conditionally assigned names for a given piece of code.

# unbound := names which are read by node before being assigned
# assigned := names which are assigned by operations in node
# cond_assigned := names which may be assigned depending on execution
analyzer = NameAnalyzer(omit=self.namespace.keys())
unbound, assigned, conditionally_assigned = analyzer.analyze(node)

This information is then used to determine what the args, kwargs, and return names should be for the wrapping function.

# args := names which already existed and are needed for ops or wrapped return
# kwargs := names which are needed for return but did not already exist
# returns := names which are assigned or could be assigned
args = (unbound | conditionally_assigned) & self._local_names
kwargs = conditionally_assigned - self._local_names
returns = assigned | conditionally_assigned

Wrapping methods

Some untraceable code references `self` during execution. While normally, `self` would be an argument to the wrapped function, `self` is usually a `torch.nn.Module` which is not a handled type that can be passed around the graph. Instead, we treat `self` as a variable in the compiled python module namespace, and this namespace is automatically captured and executed by `torch.fx._symbolic_trace`

Unwrappable code

Some code cannot be wrapped because it contains control flow statements which must exist in a certain context. For example, we cannot wrap code that contains a `continue` without also wrapping the for loop that surrounds it.

for index, layer in enumerate(self.layers):
    # ---- cannot autowrap ----
    if index <= 10:
        continue
    # ---- cannot autowrap ----
    
    hidden_states = layer(hidden_states)

Future Extensions/ Improvements

Sequentially executing vision towers

Sequentially tracing vision towers is a lower priority, as the vision towers are typically fewer parameters and aren't quantization targets. However, in the event that they do become quantization targets, or memory(vision_tower + one target) > memory(one gpu), then the vision tower layers will need to be split up.

Some changes may be required to support this. Conditionally executing the vision tower is a very common pattern:

def forward(pixel_values, image_embeds, ...):
    if image_embeds is None:
        image_embeds = self.vision_tower(pixel_values)
    ...

Some approaches might be

Allowing names like image_embeds to be evaluated based on the sample input being passed
Pattern matching against self.{module_name}(), where module_name is determined to be a module through evaluation
Using type hinting analysis tools like jedi to track the types of all names, and to check if any names whose type is a module are called

Towards perfect autowrapping

As mentioned in in "Sequentially executing vision towers", it may be possible to use use type hinting analysis tools like jedi or pytype to infer whether any given code chunk calls a sequential target or target ancestor. If this can be done reliably (which will require extensions such as analysis of called functions), then all code that does not call sequential targets can be wrapped.

Towards removing tracing

If we can reliably determine if a given code chunk calls sequential any targets, it may be possible to define each autowrapped function as its own subgraph directly, without the need for tracing. However, this will require inference of model execution from the static code, which means unrolling loops, expanding any function/method calls, and resolving dynamic model structure (in the case of llava and idefics), which may be tricky.

For example, if you want to determine the execution of the LlamaDecoder layers of a llava model like pixtral, you'd need to evaluate self.language_model, then analyze the source of the caller's forward function, which is stored in a separate file.

def forward(...):
    self.language_model(...)  # the type of self.language_model is determined from the config

Another point of evaluation would be evaling any iteration of ModuleLists

def forward(...):
    for decoder_layer of self.layers:  # ModuleList isn't well typed and may contain different types
        decoder_layer(...)

The tracing system could be replaced with static code inference, both are different systems for solving the problem of determining model execution

Testing

Able to trace all models in tests/llmcompressor/transformers/tracing/test_models.py without requiring traceable definitions
Verified sequentially executed outputs are correct for LlamaForCausalLM
Ran examples/quantization_w4a16/llama3_example.py to completion

Signed-off-by: Kyle Sayers <[email protected]>

SUMMARY: Consolidate all build configuration into `setup.py`. The current split between `pyproject.toml` and `setup.py` seems to cause some kind of race condition/unpredictable behavior with the tooling regarding whether it will honor the version functions defined in `setup.py`. TEST PLAN: Relevant changes are identical to those in neuralmagic/compressed-tensors#304; and a build is produced internally here: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14732959015 Signed-off-by: Domenic Barbuzzi <[email protected]>

## Background ## The current KV cache tests are silently failing because Qmod(kv) + GPTQ(weights) is not a supported recipe. This is because GPTQ was being entirely skipped (because it was being preceded by a quantization modifier with no weight schemes). If you attempt to fix this by disallowing multi-qconfig recipes, you run into the issue that model compression with KV+weights is not supported. ## Multi-round quantization ## Previously, the model did not have weight quantization. This means that there was no compressor. This means that the model would be saved in the frozen status, not the compressed status. This would mean that the model would be loaded, and the inferred status would be None, which means that at load time, the status would move from none to frozen, hence passing through initialization. After fixing the recipe to run GPTQ with kv+weight quantization, you run into another issue. ## KV + Weight Compression ## Now, the model has weight quantization, meaning there is a compressor. This means that the model will be saved in the compressed status. This means that the model would be loaded with CompressedLinear (which are in the frozen status), causing the whole model to be inferred as compressed. Because the model is already supposedly in the compressed status, then initialization does not happen and the kv_cache attention parameters are not initialized, and hence the model fails to load kv_cache qparams from the state dict. ## Ideal Solution ## Ideally, we should be replacing with CompressedLinear as a part of apply_quantization_status, not before applying quantization status. Doing it the other way unintentionally skips all the lifecycle steps As a side note, ideally, we should always be saving final models with the compressed status, even if the compressor is dense. [structure is initialized, checkpoints are calibration or frozen, and final models are compressed (even if nothing was applied, so long as save_compressed)], but this is out of scope of this fix. --------- Signed-off-by: Kyle Sayers <[email protected]>

SUMMARY: Drop the skip related to requiring `flash_attn` be installed in the tests for the `quantizing_moe` examples. Recent CI failures related to this package and CUDA compatibility with the recently released PyTorch 2.7.0 has resulted in findings that it is not required for these tests. TEST PLAN: An [internal test run][1] that drops the installation of `flash-attn` and runs the changes on this branch indicates that the tests will pass (one successful so far, will mark PR as ready once the run completes and the remaining show expected results). Specific relevant output (will update with other tests’ results): ``` tests/examples/test_quantizing_moe.py::TestQuantizingMOE::test_deepseek_example_script[deepseek_moe_w8a8_int8.py] PASSED tests/examples/test_quantizing_moe.py::TestQuantizingMOE::test_deepseek_example_script[deepseek_moe_w8a8_fp8.py] PASSED ``` [1]: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14712618904 Signed-off-by: Domenic Barbuzzi <[email protected]>

…t for gpu from gha (#1264) ## Purpose ## * Update all tests to use `requires_gpu` decorator * Add GPU mark skip for `test_compressor_stacking`, which requires a GPU * Add an explicit GPU test for GHA, so as to unambiguously catch situations where CUDA is not properly installed on a runner --------- Signed-off-by: Kyle Sayers <[email protected]>

## Purpose ## * Abstract functionality which allows modifiers to act as quantization configs into a mixin called `QuantizationMixin` * This gives #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines) * This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now) * Related to #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant ## Changes ## * Implement `QuantizationMixin` which implements five public methods * Lifecycle methods * `initialize_quantization` is used to apply a config and attach observers to a model * quantization is disabled so that modules aren't quantized before they're calibrated * `start_calibration` is used to initialize calibration hooks and status * quantization is enabled, since we currently quantize as we calibrate, although this decision is somewhat arbitrary * `end_calibration` is used to remove calibration hooks and apply the frozen status * quantization remains enabled, since we want future forward passes to simulate quantization * Recipe-related methods * `has_config` returns true if a config was specified, used for checking against duplicate configs in the recipe * `resolve_quantization_config` returns the quantization config specified by the modifier fields * `QuantizationModifier` inherits from `QuantizationMixin` * `GPTQModifier` inherits from `QuantizationMixin` * Unlike QMod, GPTQ disables quantization during calibration. As noted before, this is a somewhat arbitrary choice but one which matches the current implementation * Calibration utils * Replace `set_unset_kv_cache` with `initialize_quantized_kv_cache` and `freeze_module_quantization` * Treat the `QuantizedKVCache` as analogous to another observer * Pull setting the calibration status out of`update_weight_zp_scale` * This better matches the lifecycle detailed in `QuantizationMixin` description * Implement `reset_quantization_status` which is used to remove any existing quantization configs before the current config is applied by `initialize_quantization` ## Remove Support ## * Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by #1279) * Remove `num_calibration_steps`, `quantize`, `disable_quantization_observer_epoch` and `min_tokens_per_module` * `num_calibration_steps` is already controlled by https://github.com/vllm-project/llm-compressor/blob/42b62f5283d0234b26623fe1f1bf02a77c6e4019/src/llmcompressor/datasets/utils.py#L106 * `quantize` was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level * `disable_quantization_observer_epoch` seems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmod * `min_tokens_per_module` requires that the modifier have references to the calibration dataset, which is disallowed by #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for `QuantizationModifier`, then it can be reimplemented to avoid using references to the calibration dataset ## Testing ## * Updated tests to reflect new mixin * Ran a set of GPTQ and QuantizationModifier examples to completion * CI tests pass --------- Signed-off-by: Kyle Sayers <[email protected]>

This PR updates the main README.md to introduce a "New Features" section, improving visibility for recent major additions to LLM Compressor. This section highlights: - Axolotl Sparse Finetuning Integration (https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor) - AutoAWQ Integration for low-bit weight quantization (#1177) - Day 0 Llama 4 support and its use by Meta This helps users quickly understand the latest capabilities of the library. --------- Signed-off-by: Rahul Tuli <[email protected]>

SUMMARY: Add support for tracing of Gemma3: [issue#1248](#1248). Steps that I have done: 1. Create gemma3.py from HF and update __init__.py. 2. Classes and functions that I modified: 2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward 2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position TEST PLAN: Ran: `llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision` Output: <img width="796" alt="trace_output" src="https://github.com/user-attachments/assets/8f5c9c7d-32a9-4b12-b4b2-10b6a4352846" /> This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully. Gemma3 can go through tracing now, but we might need further tests for the quantization as well. --------- Signed-off-by: Kelvin Cheng <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Domenic Barbuzzi <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Vedant <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]> Co-authored-by: Domenic Barbuzzi <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-05-27T21:06:56Z

@brian-dellabetta The types of tracing failures are as follows:

The model fails to trace
The model traces but does not create the expected number of subgraphs
The model traces but the traced subgraphs do not produce the same output as the original model

(1) and (2) are covered by the tracing tests. (3) does not currently have a test (I'll add this as a follow up), but from my testing of the examples, it seems that this is not an issue

brian-dellabetta

discussed further over a call, impressive work!

dsikka

the tracing guide was so well written
im sad to see it go :(

kylesayrs · 2025-05-28T18:14:41Z

@dsikka I have no such attachment :)

dsikka

This LGTM but I was unfamiliar with the ast library so I used this source to try and understand the manipulation provided by the classe: https://github.com/xbeat/Machine-Learning/blob/main/Exploring%20Python's%20Abstract%20Syntax%20Tree%20Manipulation.md

We should potentially provide some sort of docs/resource in the source code as well before landing

src/llmcompressor/pipelines/sequential/ast_helpers.py

src/llmcompressor/pipelines/sequential/ast_utils/control_flow_analyzer.py

src/llmcompressor/pipelines/sequential/ast_utils/name_analyzer.py

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

awesome stuff!

kylesayrs · 2025-05-29T14:54:42Z

Seems like there's a problem specifically with loading llama4 processors https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/discussions/45 🫠

Signed-off-by: Kyle Sayers <[email protected]>

## Purpose ## * Support the latest transformers release ## Prerequisites ## * #1481 * #1411 ## Fixes ## * #1457 ## Changes ## * Unpin transformers version * Add `torchvision`, `librosa`, and `soundfile` to dev dependencies (needed to test models) * Fix default ignore list for tracing debugger * Add back llama4 model tests --------- Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 19 commits February 24, 2025 14:47

replace with patch_attr

ca96907

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/rename-patch_attr

755a063

Merge branch 'main' into kylesayrs/rename-patch_attr

96cf84e

simplify

2f0136c

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/rename-patch_attr

803b73f

remove dreg

5dfaabf

Signed-off-by: Kyle Sayers <[email protected]>

add utils

35a046e

Signed-off-by: Kyle Sayers <[email protected]>

add no init weights context

547e68f

Signed-off-by: Kyle Sayers <[email protected]>

add tracing tests

bb1912c

Signed-off-by: Kyle Sayers <[email protected]>

add test

71c5575

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/tracing-testing

0e074e4

rename file to be picked up by pytest

ce1b91c

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/tracing-testing

daaf284

add hf token

b144eb1

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/tracing-testing

d17877b

remove hf cache dir, remove whisper

f212a3e

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/tracing-testing

5875aa1

cleanup, do not require ignore

8c75c0d

Signed-off-by: Kyle Sayers <[email protected]>

add import skip

e85ec84

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the base branch from main to kylesayrs/tracing-testing May 4, 2025 22:41

dbarbuzzi and others added 10 commits May 4, 2025 18:42

wip

7c0f855

Signed-off-by: Kyle Sayers <[email protected]>

wip: working

75c4de4

Signed-off-by: Kyle Sayers <[email protected]>

add util

e63bd2b

Signed-off-by: Kyle Sayers <[email protected]>

cleanup, add tests

a337188

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request May 28, 2025

Unpin to support transformers==4.52.3 #1479

Merged

brian-dellabetta previously approved these changes May 28, 2025

View reviewed changes

dsikka reviewed May 28, 2025

View reviewed changes

src/llmcompressor/pipelines/sequential/ast_helpers.py Show resolved Hide resolved

src/llmcompressor/pipelines/sequential/ast_utils/control_flow_analyzer.py Show resolved Hide resolved

src/llmcompressor/pipelines/sequential/ast_utils/name_analyzer.py Show resolved Hide resolved

kylesayrs added 3 commits May 28, 2025 19:00

add docstrings

24939f4

Signed-off-by: Kyle Sayers <[email protected]>

combine tests

d29243b

Signed-off-by: Kyle Sayers <[email protected]>

add tests

f203844

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via f203844 May 28, 2025 23:20

Merge branch 'main' into kylesayrs/autowrapper

45d1193

brian-dellabetta previously approved these changes May 29, 2025

View reviewed changes

remove llama4 tests for now

2e71168

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via 2e71168 May 29, 2025 15:34

brian-dellabetta approved these changes May 29, 2025

View reviewed changes

dsikka approved these changes May 29, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/autowrapper

9190b0b

kylesayrs enabled auto-merge (squash) May 29, 2025 15:45

Merge branch 'main' into kylesayrs/autowrapper

a98ac15

kylesayrs merged commit 9417d5c into main May 29, 2025
11 checks passed

kylesayrs deleted the kylesayrs/autowrapper branch May 29, 2025 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracing] Code AutoWrapper #1411

[Tracing] Code AutoWrapper #1411

Uh oh!

kylesayrs commented May 4, 2025 •

edited

Loading

Uh oh!

kylesayrs commented May 27, 2025

Uh oh!

brian-dellabetta left a comment

Uh oh!

dsikka left a comment

Uh oh!

kylesayrs commented May 28, 2025

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

kylesayrs commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

[Tracing] Code AutoWrapper #1411

[Tracing] Code AutoWrapper #1411

Uh oh!

Conversation

kylesayrs commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Fixes

Autowrap Patterns

Autowrap Implementation Details

Future Extensions/ Improvements

Testing

Uh oh!

kylesayrs commented May 27, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented May 28, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented May 4, 2025 •

edited

Loading