Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughAdds a new LayerActivationCollector module for sequential, per-decoder-layer activation capture and calibration; integrates it into the sequential_calibrate flow, adds HuggingFace decoder discovery helpers, removes prior in-file collector utilities, and expands tests covering discovery and sequential calibration behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller / Test
participant LAC as LayerActivationCollector
participant Model as Model
participant Layer as DecoderLayer
Caller->>LAC: _patch_all_layers()
LAC->>Model: attach _seq_calib to decoder layers
LAC->>Layer: replace forward with _patched_forward
loop per target layer
Caller->>LAC: _set_layer_states(target=Layer)
LAC->>Layer: set mode=capture (others=skip/run)
Caller->>LAC: get_input_activations(Layer, forward_loop)
LAC->>Model: run forward_loop
Model->>Layer: forward() -> _patched_forward(mode=capture)
Layer-->>LAC: record inputs + raise _EarlyStopForwardError
LAC->>LAC: collect inputs, restore layer mode
LAC-->>Caller: return captured inputs
end
Caller->>LAC: _unpatch_all_layers()
LAC->>Layer: restore original forward
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
13b9033 to
7f72422
Compare
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
modelopt/torch/quantization/utils.py
Outdated
|
|
||
| class LayerActivationCollector: | ||
| """Helper class for collecting layer activations during forward passes. | ||
| @dataclass |
There was a problem hiding this comment.
can we move the patching & capture related logics to a new file?
There was a problem hiding this comment.
Pull request overview
Refactors sequential calibration to avoid O(N²) activation collection by introducing a stateful, model-forward-based activation collection approach (skip/run/capture) that preserves parent-module forward logic.
Changes:
- Replaces the previous decoder-layer detection + per-layer patching with a state-aware
LayerActivationCollectorthat patches all decoder layers and uses early-stop to capture inputs efficiently. - Updates
sequential_calibrate()to use the new collector and removes the legacyget_decoder_layers()heuristic frommodelopt/torch/utils/network.py. - Adds/extends unit tests to validate skip/run/capture behavior, decoder-layer discoverer registration, and HuggingFace integration.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/torch/quantization/test_utils.py | Adds unit tests for the new LayerActivationCollector registration/discovery API and skip/run/capture semantics. |
| tests/unit/torch/quantization/test_sequential_calibrate.py | Adds deeper behavioral tests for tuple outputs, inter-layer ops, mode transitions, and cleanup for sequential calibration patching. |
| tests/unit/torch/quantization/test_calib.py | Adds tests for the new sequential calibration support gate and activation propagation behavior. |
| tests/unit/torch/quantization/plugins/test_huggingface.py | Adds tests validating HuggingFace decoder-layer discoverer registration and homogeneous model detection. |
| modelopt/torch/utils/network.py | Removes the old get_decoder_layers() heuristic helper. |
| modelopt/torch/quantization/utils.py | Introduces the new stateful LayerActivationCollector with patched forward and output-metadata-based skipping. |
| modelopt/torch/quantization/plugins/huggingface.py | Registers HuggingFace decoder-layer discoverers with LayerActivationCollector. |
| modelopt/torch/quantization/model_calib.py | Updates sequential_calibrate() to use the new collector lifecycle (patch once, calibrate layer-by-layer, unpatch). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def is_homogenous_hf_model(model: nn.Module) -> bool: | ||
| if is_nemotron_h_model(model): | ||
| return False | ||
| decoder_layers = get_homogeneous_hf_decoder_layers(model) | ||
| if decoder_layers is None or len(decoder_layers) == 0: | ||
| return False | ||
| layer_classes = {type(layer) for layer in decoder_layers} | ||
| return len(layer_classes) == 1 |
There was a problem hiding this comment.
The helper is named is_homogenous_hf_model, but the correct term is “homogeneous”. Since this is a newly introduced API surface (and is referenced by tests/registration), consider renaming it (and its uses) to is_homogeneous_hf_model for clarity and to avoid baking in a misspelling.
| def _register_test_discoverer(monkeypatch): | ||
| """Register a simple discoverer that finds model.layers on any model.""" | ||
| monkeypatch.setattr( | ||
| LayerActivationCollector, | ||
| "_decoder_layer_support", | ||
| [(lambda m: hasattr(m, "layers"), lambda m: m.layers)], | ||
| ) |
There was a problem hiding this comment.
This file now introduces _register_test_discoverer() for decoder-layer discovery, but earlier tests in the same module still call LayerActivationCollector.get_input_activations() / sequential_calibrate() without registering any decoder-layer support (and without patching collector layers). With the new LayerActivationCollector implementation, those tests will fail unless they register a discoverer (like this helper) and patch/unpatch appropriately.
modelopt/torch/quantization/utils.py
Outdated
| _decoder_layer_support: list[tuple[Any, Any]] = [] | ||
| _LAYER_ATTR = "_seq_calib" | ||
|
|
||
| def __init__(self, model: nn.Module): | ||
| self.model = model | ||
| self._decoder_layers: nn.ModuleList | None = None | ||
| self._layer_to_idx: dict[nn.Module, int] = {} | ||
| self._patched = False | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # Decoder-layer discovery | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| @staticmethod | ||
| def _patch_and_initialize_layer(layer: torch.nn.Module, stop_after_collection: bool = False): | ||
| """Patch a layer to collect inputs during forward passes.""" | ||
| def get_decoder_layers(model: nn.Module) -> nn.ModuleList | None: | ||
| """Return decoder layers supported by sequential calibration.""" | ||
| for is_supported, discoverer in LayerActivationCollector._decoder_layer_support: | ||
| if not is_supported(model): | ||
| continue | ||
| decoder_layers = discoverer(model) | ||
| if decoder_layers is not None: | ||
| return decoder_layers | ||
| return None |
There was a problem hiding this comment.
LayerActivationCollector._decoder_layer_support defaults to an empty list, so LayerActivationCollector.is_supported() (and thus sequential_calibrate()) will always fail unless some plugin/module has been imported that registers discoverers. If sequential_calibrate/collector are intended to be usable when imported directly (without modelopt.torch.quantization which imports plugins), consider registering baseline discoverers for common patterns (e.g. model.layers, model.model.layers, model.decoder.layers, model.transformer.h) in utils.py, similar to the removed get_decoder_layers heuristic.
modelopt/torch/quantization/utils.py
Outdated
| Layers before the target are skipped or re-run (if just calibrated), the | ||
| target layer captures its inputs, and an early-stop prevents unnecessary | ||
| computation beyond the target. | ||
| """ |
There was a problem hiding this comment.
LayerActivationCollector.get_input_activations() assumes _patch_all_layers() has already been called (it indexes into _layer_to_idx and relies on patched forwards). As written, calling this public method without patching will raise KeyError / behave incorrectly. Either make patching lazy inside get_input_activations() (patch on first use) or add a clear assertion/error telling callers to call _patch_all_layers() first.
| """ | |
| """ | |
| # Ensure layers have been patched before attempting to collect activations. | |
| if not hasattr(self, "_layer_to_idx") or layer not in self._layer_to_idx: | |
| raise RuntimeError( | |
| "LayerActivationCollector.get_input_activations() was called before layers were patched. " | |
| "Make sure to call LayerActivationCollector._patch_all_layers() (or the appropriate " | |
| "patching method) before requesting input activations." | |
| ) |
modelopt/torch/quantization/utils.py
Outdated
| if info.output_meta is not None: | ||
| return LayerActivationCollector._zeros_from_meta(info.output_meta) | ||
| print_rank_0(f"Layer {info.name} is in 'skip' mode but has no output meta to return") | ||
| return args[0] if args else next(iter(kwargs.values())) |
There was a problem hiding this comment.
In skip mode, when output_meta is missing the code falls back to returning the first input (args[0] / first kwarg). This can break parent forwards that expect the layer’s real output structure (e.g., tuple unpacking) and can silently mask state bugs. Prefer raising an explicit error (or computing/storing output_meta earlier) instead of returning an arbitrary fallback.
| return args[0] if args else next(iter(kwargs.values())) | |
| raise RuntimeError( | |
| f"Sequential calibration state error: layer {info.name} is in 'skip' mode " | |
| "but has no recorded output_meta. Ensure a prior 'run' or 'capture' phase " | |
| "has executed and populated output_meta before switching to 'skip' mode." | |
| ) |
modelopt/torch/quantization/utils.py
Outdated
| if info.mode == "run": | ||
| assert info.cached_inputs, ( | ||
| f"Layer {info.name} is in 'run' mode but has no cached inputs to replay." | ||
| ) | ||
| real_args, real_kwargs = info.cached_inputs.pop(0) | ||
| output = self._original_forward(*real_args, **real_kwargs) | ||
| info.output_meta = LayerActivationCollector._extract_output_meta(output) |
There was a problem hiding this comment.
run mode consumes cached inputs via pop(0), which is O(n) per batch (shifts the list) and can become costly for large calibration sets. Use a collections.deque for cached_inputs (and popleft()), or track an index pointer instead of mutating the front of a list.
Logic Correctness ReviewOverall, this is a well-designed refactor that correctly implements the state machine for sequential calibration. However, I found several logic issues that should be addressed: 🔴 Critical Issues1. Race condition in state transitions for layer 0 and 1In if layer_idx > 1:
done = self._decoder_layers[layer_idx - 2]._seq_calib
done.mode = "skip"
done.cached_inputs = [] # Free memory
if layer_idx > 0:
prev = self._decoder_layers[layer_idx - 1]._seq_calib
prev.mode = "run"
prev.cached_inputs = prev.collected_inputs # Move collected → cached
prev.collected_inputs = []Issue: When
However, when
But there's a subtle bug: layer 0's Iteration 0 (calibrating layer 0):
Iteration 1 (calibrating layer 1):
Wait, let me re-check. Actually the logic looks correct because The real bug: When layer 0 transitions to "original" after its calibration, and then we start calibrating layer 1, we call Actually, looking more carefully at the test Let me reconsider... The logic seems correct for the intended use case. However, there's a different issue: 2. Missing validation: What if
|
modelopt/torch/quantization/utils.py
Outdated
| * Layer ``i - 1`` → **run** (replay captured inputs with calibrated weights). | ||
| * Layer ``i`` → **capture** (record inputs, then early-stop). | ||
| """ | ||
| assert self._decoder_layers is not None |
There was a problem hiding this comment.
Consider adding validation here to check if collected_inputs is non-empty before transitioning to 'run' mode:\n\npython\nif layer_idx > 0:\n prev = self._decoder_layers[layer_idx - 1]._seq_calib\n if not prev.collected_inputs:\n raise RuntimeError(\n f"Layer {layer_idx - 1} has no collected inputs to replay. "\n "This may indicate the previous layer was never called during forward."\n )\n prev.mode = "run"\n prev.cached_inputs = prev.collected_inputs\n prev.collected_inputs = []\n\n\nThis would provide a clearer error message if something goes wrong in the capture phase.
modelopt/torch/quantization/utils.py
Outdated
| # ------------------------------------------------------------------ | ||
|
|
||
| def _set_layer_states(self, layer_idx: int): | ||
| """Transition layer modes for the next calibration step. |
There was a problem hiding this comment.
Minor memory optimization: consider clearing output_meta when transitioning to 'skip' mode:\n\npython\nif layer_idx > 1:\n done = self._decoder_layers[layer_idx - 2]._seq_calib\n done.mode = "skip"\n done.cached_inputs = []\n done.output_meta = None # Free memory for large models with many layers\n\n\nWhile output_meta is small per layer, it could add up for models with hundreds/thousands of layers.
modelopt/torch/quantization/utils.py
Outdated
| layer._seq_calib = _LayerCalibState( | ||
| name=module_to_name.get(layer, type(layer).__name__), | ||
| ) | ||
| bind_forward_method(layer, self._patched_forward, "_original_forward") |
There was a problem hiding this comment.
Exception safety concern: if an exception occurs during patching (e.g., bind_forward_method fails on layer N), the _patched flag remains False but some layers may be partially patched. The _unpatch_all_layers() won't clean up because it checks if not self._patched: return.\n\nConsider either:\n1. Setting _patched = True only after all patches succeed, OR\n2. Cleaning up on exception in _patch_all_layers(), OR \n3. Making _unpatch_all_layers() always attempt cleanup regardless of _patched flag\n\nExample fix:\npython\ndef _patch_all_layers(self):\n try:\n self._decoder_layers = self.get_decoder_layers(self.model)\n assert self._decoder_layers is not None\n self._layer_to_idx = {layer: i for i, layer in enumerate(self._decoder_layers)}\n module_to_name = {m: name for name, m in self.model.named_modules()}\n \n for layer in self._decoder_layers:\n layer._seq_calib = _LayerCalibState(\n name=module_to_name.get(layer, type(layer).__name__),\n )\n bind_forward_method(layer, self._patched_forward, "_original_forward")\n \n self._patched = True # Only set after all patches succeed\n except Exception:\n # Cleanup on failure\n self._unpatch_all_layers()\n raise\n
modelopt/torch/quantization/utils.py
Outdated
| self._patched = False | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # Decoder-layer discovery |
There was a problem hiding this comment.
Potential issue with non-tensor outputs: when a layer returns a non-tensor value (e.g., a dict or mutable object), _extract_output_meta stores the actual value in meta[1], and _zeros_from_meta returns the same instance:\n\npython\n# In _extract_output_meta:\nreturn ("other", output) # Stores reference to original object\n\n# In _zeros_from_meta:\nreturn meta[1] # Returns same reference!\n\n\nThis could lead to unexpected mutations if the skip layer's output is modified by downstream code. Consider:\n1. Deep-copying non-tensor values, OR\n2. Documenting that mutable non-tensor outputs are not supported, OR\n3. Returning a safe default (e.g., None or copy.deepcopy(meta[1]))\n\nFor most cases (None, immutable values) this is fine, but it's a potential footgun for heterogeneous architectures.
modelopt/torch/quantization/utils.py
Outdated
| if layer_idx > 0: | ||
| prev = self._decoder_layers[layer_idx - 1]._seq_calib | ||
| prev.mode = "run" | ||
| prev.cached_inputs = prev.collected_inputs |
There was a problem hiding this comment.
I'm a bit confusing why we need both cached_inputs and collected_inputs in _LayerCalibState, is it mainly to indicate which state this data(activation) is in?
There was a problem hiding this comment.
We don't really need it to be separate. Just have it that way to make it clear and avoid using stale values.
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
1199-1206: Consider making the generic HF discoverer structural instead of homogeneous-only.
LayerActivationCollectornow tracksoutput_metaper layer, and the new regression cases cover heterogeneous layer stacks. Requiring a single layer class here seems stricter than the runtime needs and will exclude mixed-block decoders that still expose a cleanmodel.model.layerslist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1199 - 1206, The current is_homogenous_hf_model function rejects decoders with mixed layer classes; instead relax it to a structural discoverer: in is_homogenous_hf_model (and still respect is_nemotron_h_model) return True whenever get_homogeneous_hf_decoder_layers(model) yields a non-empty decoder layer list and the model exposes a layers sequence (e.g. model.model.layers) whose elements provide the runtime metadata LayerActivationCollector expects (check for the presence of output_meta or another per-layer attribute/ability that LayerActivationCollector uses), and remove the strict len(layer_classes) == 1 check; keep using get_homogeneous_hf_decoder_layers and is_nemotron_h_model as anchors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/activation_collector.py`:
- Around line 50-52: The current implementation stores a single output_meta that
gets overwritten in ActivationCollector.run and reused in skip, causing replayed
batches to get incorrect shapes; change output_meta from a single tuple to a
per-batch sequence (e.g., a deque or list) named output_meta_list or
output_meta: deque, append the metadata for each captured batch inside
ActivationCollector.run (use the same indexing order as
cached_inputs/collected_inputs), update ActivationCollector.skip and any replay
logic to consume/peek the corresponding per-batch metadata in FIFO order
(maintaining alignment with cached_inputs entries) and remove or rotate entries
consistently after replays so each replayed batch uses its original per-batch
output_meta rather than a shared slot (also apply the same change to the other
occurrence referenced around lines 163-179).
- Around line 241-265: The state-transition code in _set_layer_states assumes
prior stages captured/replayed successfully; add fail-fast checks before
changing modes: verify done.output_meta is present (non-empty) before setting
done.mode="skip" and verify prev.collected_inputs is non-empty before setting
prev.mode="run" and copying to prev.cached_inputs; if either check fails, raise
a clear RuntimeError indicating the specific layer index and missing data (use
layer_idx and references to self._decoder_layers[...] ._seq_calib). After
forward_loop() completes, also validate that the current layer's
collected_inputs is non-empty and raise a descriptive error immediately if the
capture is empty. Apply the same guardrails to the analogous transition block
later in the file (the other spot that manipulates ._seq_calib,
.collected_inputs and .cached_inputs).
---
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 1199-1206: The current is_homogenous_hf_model function rejects
decoders with mixed layer classes; instead relax it to a structural discoverer:
in is_homogenous_hf_model (and still respect is_nemotron_h_model) return True
whenever get_homogeneous_hf_decoder_layers(model) yields a non-empty decoder
layer list and the model exposes a layers sequence (e.g. model.model.layers)
whose elements provide the runtime metadata LayerActivationCollector expects
(check for the presence of output_meta or another per-layer attribute/ability
that LayerActivationCollector uses), and remove the strict len(layer_classes) ==
1 check; keep using get_homogeneous_hf_decoder_layers and is_nemotron_h_model as
anchors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 913af19a-8050-4673-b980-42453284b8d7
📒 Files selected for processing (9)
modelopt/torch/quantization/activation_collector.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils.pymodelopt/torch/utils/network.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_calib.pytests/unit/torch/quantization/test_sequential_calibrate.pytests/unit/torch/quantization/test_utils.py
💤 Files with no reviewable changes (1)
- modelopt/torch/utils/network.py
| cached_inputs: deque = field(default_factory=deque) | ||
| collected_inputs: list = field(default_factory=list) | ||
| output_meta: tuple | None = None |
There was a problem hiding this comment.
output_meta only tracks the last batch.
run overwrites a single output_meta on every replayed batch, and skip reuses that one shape for every later batch. With a normal drop_last=False loader or variable sequence lengths, a later pass can synthesize dummy outputs with the wrong dimensions for earlier batches. This needs per-batch metadata that can be replayed in order on each future pass, not one shared slot per layer.
Also applies to: 163-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/activation_collector.py` around lines 50 - 52,
The current implementation stores a single output_meta that gets overwritten in
ActivationCollector.run and reused in skip, causing replayed batches to get
incorrect shapes; change output_meta from a single tuple to a per-batch sequence
(e.g., a deque or list) named output_meta_list or output_meta: deque, append the
metadata for each captured batch inside ActivationCollector.run (use the same
indexing order as cached_inputs/collected_inputs), update
ActivationCollector.skip and any replay logic to consume/peek the corresponding
per-batch metadata in FIFO order (maintaining alignment with cached_inputs
entries) and remove or rotate entries consistently after replays so each
replayed batch uses its original per-batch output_meta rather than a shared slot
(also apply the same change to the other occurrence referenced around lines
163-179).
| def _set_layer_states(self, layer_idx: int): | ||
| """Transition layer modes for the next calibration step. | ||
|
|
||
| When calibrating layer *i*, three transitions happen: | ||
|
|
||
| * Layer ``i - 2`` → **skip** (fully done, free its cached inputs). | ||
| * Layer ``i - 1`` → **run** (replay captured inputs with calibrated weights). | ||
| * Layer ``i`` → **capture** (record inputs, then early-stop). | ||
| """ | ||
| assert self._decoder_layers is not None | ||
|
|
||
| if layer_idx > 1: | ||
| done = self._decoder_layers[layer_idx - 2]._seq_calib | ||
| done.mode = "skip" | ||
| done.cached_inputs = deque() | ||
|
|
||
| if layer_idx > 0: | ||
| prev = self._decoder_layers[layer_idx - 1]._seq_calib | ||
| prev.mode = "run" | ||
| prev.cached_inputs = deque(prev.collected_inputs) | ||
| prev.collected_inputs = [] | ||
|
|
||
| cur = self._decoder_layers[layer_idx]._seq_calib | ||
| cur.mode = "capture" | ||
| cur.collected_inputs = [] |
There was a problem hiding this comment.
Fail fast when the sequential state machine falls out of sync.
These transitions assume the previous pass both captured inputs and replayed once. If a caller skips a decoder layer, or forward_loop() never reaches the target, you only discover it later when a skipped layer hits the RuntimeError on Line 165. Please validate done.output_meta / prev.collected_inputs before switching modes, and reject an empty capture right after forward_loop() so the failure is reported at the source.
Suggested guardrails
def _set_layer_states(self, layer_idx: int):
assert self._decoder_layers is not None
if layer_idx > 1:
done = self._decoder_layers[layer_idx - 2]._seq_calib
+ if done.output_meta is None:
+ raise RuntimeError(
+ f"Layer {done.name} cannot enter 'skip' before a successful replay."
+ )
done.mode = "skip"
done.cached_inputs = deque()
if layer_idx > 0:
prev = self._decoder_layers[layer_idx - 1]._seq_calib
+ if not prev.collected_inputs:
+ raise RuntimeError(
+ f"Layer {prev.name} has no captured inputs to replay."
+ )
prev.mode = "run"
prev.cached_inputs = deque(prev.collected_inputs)
prev.collected_inputs = [] forward_loop(self.model)
info = layer._seq_calib
inputs = list(info.collected_inputs)
# After capture, set to original so calib_func can call the layer's
# real forward directly. The layer will transition to run → skip
# in subsequent iterations via _set_layer_states.
info.mode = "original"
+ if not inputs:
+ raise RuntimeError(
+ f"Layer {info.name} did not capture any inputs during forward_loop()."
+ )
return inputsAlso applies to: 293-307
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/activation_collector.py` around lines 241 - 265,
The state-transition code in _set_layer_states assumes prior stages
captured/replayed successfully; add fail-fast checks before changing modes:
verify done.output_meta is present (non-empty) before setting done.mode="skip"
and verify prev.collected_inputs is non-empty before setting prev.mode="run" and
copying to prev.cached_inputs; if either check fails, raise a clear RuntimeError
indicating the specific layer index and missing data (use layer_idx and
references to self._decoder_layers[...] ._seq_calib). After forward_loop()
completes, also validate that the current layer's collected_inputs is non-empty
and raise a descriptive error immediately if the capture is empty. Apply the
same guardrails to the analogous transition block later in the file (the other
spot that manipulates ._seq_calib, .collected_inputs and .cached_inputs).
cjluo-nv
left a comment
There was a problem hiding this comment.
Review Summary
The core refactoring from O(N²) to O(N) sequential calibration via skip/run/capture state machine is well-designed. The architecture, cleanup handling, and test coverage are all strong. However, there are a few issues that should be addressed before merge.
Critical: Missing model pattern registrations (regression)
The old get_decoder_layers() in network.py supported 5 model patterns. The PR only registers 2 (Nemotron-H and homogeneous HF). Missing support for:
- Megatron/MCore:
model.decoder.layers— no registration inplugins/megatron.py - GPT-style:
model.transformer.h - Direct layers:
model.layers(when it's annn.ModuleList) - Nemotron Super/Nano without
block_type: old code matchedmodel.backbone.layersunconditionally; newget_nemotron_h_decoder_layersrequiresblock_typeonlayers[0]
Models that previously worked with sequential calibration will now raise "Could not find transformer layers".
Medium Issues
-
Typo in public API:
is_homogenous_hf_model→ should beis_homogeneous_hf_model(missing 'e'). Appears in function name, test name, and registration. Worth fixing now before it becomes a stable API. -
forward_loop is Noneguard removed: The old code explicitly raisedValueError("forward_loop must not be None ..."). The PR removes this check (and the corresponding testtest_seq_calib_raises_on_none_forward_loop) without adding an equivalent guard. ANoneforward_loop will now produce an unhelpfulTypeErrordeep in the stack. -
_decoder_layer_supportas mutable class variable (activation_collector.py:83): This list is shared across all instances and grows monotonically. Tests work around this withmonkeypatch, but in production, if plugins are re-imported or the module is reloaded, entries could accumulate. Consider documenting this as intentional or adding a guard.
Minor
- License header on the new file says
2024— should likely be2025for a new file. copy.deepcopy(meta[1])in_zeros_from_metafor the"other"case could be expensive for complex non-tensor outputs. A comment noting this is expected to be lightweight values (e.g.None) would help.
CI
linux and unit-pr-required-check are failing — should be investigated before merge.
What's Good
- Clean state machine design with skip/run/capture/original modes
- Proper
try/finallycleanup in bothsequential_calibrateand_patch_all_layers - Output metadata approach (
_extract_output_meta/_zeros_from_meta) correctly handles tuple/list/tensor outputs for skip layers - Excellent test coverage (~470 new test lines) covering mode transitions, heterogeneous layers, tuple unpacking, inter-layer norm, error paths, and cleanup
- Extensible plugin-based decoder discovery via
register_decoder_layer_support - Backward-compatible re-export of
LayerActivationCollectorfromutils.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Review Summary
The core refactoring from O(N²) to O(N) sequential calibration via skip/run/capture state machine is well-designed. The architecture, cleanup handling, and test coverage are all strong. However, there are a few issues that should be addressed before merge.
Critical: Missing model pattern registrations (regression)
The old get_decoder_layers() in network.py supported 5 model patterns. The PR only registers 2 (Nemotron-H and homogeneous HF). Missing support for:
- Megatron/MCore:
model.decoder.layers— no registration inplugins/megatron.py - GPT-style:
model.transformer.h - Direct layers:
model.layers(when it is annn.ModuleList) - Nemotron Super/Nano without
block_type: old code matchedmodel.backbone.layersunconditionally; newget_nemotron_h_decoder_layersrequiresblock_typeonlayers[0]
Models that previously worked with sequential calibration will now raise "Could not find transformer layers".
Medium Issues
-
Typo in public API:
is_homogenous_hf_modelshould beis_homogeneous_hf_model(missing 'e'). Appears in function name, test name, and registration. Worth fixing now before it becomes a stable API. -
forward_loop is Noneguard removed: The old code explicitly raisedValueError("forward_loop must not be None ..."). The PR removes this check (and the corresponding testtest_seq_calib_raises_on_none_forward_loop) without adding an equivalent guard. ANoneforward_loop will now produce an unhelpfulTypeErrordeep in the stack. -
_decoder_layer_supportas mutable class variable (activation_collector.py:83): This list is shared across all instances and grows monotonically. Tests work around this withmonkeypatch, but in production, if plugins are re-imported or the module is reloaded, entries could accumulate. Consider documenting this as intentional or adding a guard.
Minor
- License header on the new file says
2024— should likely be2025for a new file. copy.deepcopy(meta[1])in_zeros_from_metafor the"other"case could be expensive for complex non-tensor outputs. A comment noting this is expected to be lightweight values (e.g.None) would help.
CI
linux and unit-pr-required-check are failing — should be investigated before merge.
What is Good
- Clean state machine design with skip/run/capture/original modes
- Proper
try/finallycleanup in bothsequential_calibrateand_patch_all_layers - Output metadata approach (
_extract_output_meta/_zeros_from_meta) correctly handles tuple/list/tensor outputs for skip layers - Excellent test coverage (~470 new test lines) covering mode transitions, heterogeneous layers, tuple unpacking, inter-layer norm, error paths, and cleanup
- Extensible plugin-based decoder discovery via
register_decoder_layer_support - Backward-compatible re-export of
LayerActivationCollectorfromutils.py
| LayerActivationCollector.register_decoder_layer_support( | ||
| is_nemotron_h_model, get_nemotron_h_decoder_layers | ||
| ) | ||
|
|
||
| LayerActivationCollector.register_decoder_layer_support( | ||
| is_homogenous_hf_model, get_homogeneous_hf_decoder_layers | ||
| ) |
There was a problem hiding this comment.
This does not cover the pattern for Megatron models, which have model.decoder.layers. Maybe a follow up is to add similar support for mcore/
| return None | ||
|
|
||
|
|
||
| def is_homogenous_hf_model(model: nn.Module) -> bool: |
There was a problem hiding this comment.
Is this a typo? should it be is_homogeneous_hf_model?
There was a problem hiding this comment.
yeah! I will fix that. Thanks for the catch!
| output_meta: tuple | None = None | ||
|
|
||
|
|
||
| class LayerActivationCollector: |
There was a problem hiding this comment.
qq, will this work in TP/EP/PP scenarios, which we usually use for mcore?
There was a problem hiding this comment.
It might need additional sync logic to work for these scenarios.
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
1209-1216: Consider extending pattern coverage for other model architectures.The current implementation only checks
model.model.layers. As noted in a previous review, this doesn't cover patterns likemodel.decoder.layers(Megatron) or other architectures.While not blocking, consider whether additional patterns should be checked here or in a follow-up PR to improve model coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1209 - 1216, get_homogeneous_hf_decoder_layers currently only returns model.model.layers; extend its checks to cover other common HF decoder patterns (e.g., model.decoder.layers for Megatron-style models, model.model.decoder.layers, and model.transformer.layers) so more architectures are detected. Inside get_homogeneous_hf_decoder_layers, add checks in order (e.g., hasattr(model, "decoder") and hasattr(model.decoder, "layers"), then hasattr(model, "model") and hasattr(model.model, "decoder") and hasattr(model.model.decoder, "layers"), then hasattr(model, "transformer") and hasattr(model.transformer, "layers")) and return the first matching nn.ModuleList; keep the early return None behavior if none match. Ensure you reference this exact function name when implementing the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 1209-1216: get_homogeneous_hf_decoder_layers currently only
returns model.model.layers; extend its checks to cover other common HF decoder
patterns (e.g., model.decoder.layers for Megatron-style models,
model.model.decoder.layers, and model.transformer.layers) so more architectures
are detected. Inside get_homogeneous_hf_decoder_layers, add checks in order
(e.g., hasattr(model, "decoder") and hasattr(model.decoder, "layers"), then
hasattr(model, "model") and hasattr(model.model, "decoder") and
hasattr(model.model.decoder, "layers"), then hasattr(model, "transformer") and
hasattr(model.transformer, "layers")) and return the first matching
nn.ModuleList; keep the early return None behavior if none match. Ensure you
reference this exact function name when implementing the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 221748c7-9696-4123-9104-58da5f79313e
📒 Files selected for processing (6)
modelopt/torch/quantization/activation_collector.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_sequential_calibrate.py
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/quantization/activation_collector.py
- tests/unit/torch/quantization/plugins/test_huggingface.py
There was a problem hiding this comment.
@sugunav14 can we make modelopt/torch/quantization/utils a folder and move this file to the utils folder? We should preserve the backward compatibility (so we should do __init__ properly)
There was a problem hiding this comment.
Okay, I can do that!
| # ------------------------------------------------------------------ | ||
| # Patched forward | ||
| # ------------------------------------------------------------------ |
There was a problem hiding this comment.
obvious comment, can we remove it
| # ------------------------------------------------------------------ | |
| # Patched forward | |
| # ------------------------------------------------------------------ |
| return tuple(LayerActivationCollector._zeros_from_meta(m) for m in meta[1]) | ||
| if tag == "list": | ||
| return [LayerActivationCollector._zeros_from_meta(m) for m in meta[1]] | ||
| return copy.deepcopy(meta[1]) |
There was a problem hiding this comment.
nit:
| return copy.deepcopy(meta[1]) | |
| return meta[1] |
| # ------------------------------------------------------------------ | ||
|
|
||
| @staticmethod | ||
| def _patched_forward(self, *args, **kwargs): |
There was a problem hiding this comment.
nit: we could define this in _patch_all_layers -
|
@sugunav14 have you tested an end to end flow with it? As a follow up later, we can add a registry based plugin for saving the model so far (to enable resume during calibration) |
What does this PR do?
Type of change: New feature
The current sequential calibration support has O(N^2) complexity for collecting updated activations for a decoder layer. To solve this, we adopted a modular/plugin based approach which involves hooks to capture the updated activations by running forward on the previous decoder layer using cached prev layer activations. This leads to an issue with nested modules i.e. the logic in the parent module might need to be replicated in the lower level modules to ensure equivalence. For example, in the nemotron model, the parent module NemotronHModel has logic to create and select appropriate mask based on the decoder layer type (mamba vs attention).
This PR implements a more generic solution for sequential calibration, by choosing to collect activations using model forward, thereby ensuring that all the parent module logic is preserved. We use an attribute "state"on the modules to indicate whether to perform recomputation/skip the layer while running module forward. This can help us avoid redundant computations for getting updated activations.
The overall flow is as follows
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Bug Fixes / Reliability
Tests