NVIDIA · kevalmorabia97 · Dec 22, 2025 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
@@ -9,7 +9,7 @@ The supported modifications are:
 
 To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.
 
-In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
+In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
 
 ## Environment
 
@@ -21,28 +21,40 @@ pip install -e .[hf,compress]
 
 - For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
 
-## Compress the Model
-
-1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
+- To make use of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), you need to accept the terms and conditions for the corresponding model and the dataset in the Huggingface Hub. Log in to the Huggingface Hub and enter your HF token.
 
-   **_NOTE:_**
-   How to choose `intermediate_size_list`?
-   The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps.
+```bash
+hf auth login
+```
 
-   Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` MiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.
+## Compress the Model
 
-2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
+1. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
 
    dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)
 
    ```bash
    python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
    ```
 
+2. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
+
+   - `puzzle_dir` indicates a new directory for saving the resulting model.
+   - `input_hf_model_path` indicates the local directory with the input model checkpoint.
+   - `dataset_path` indicates the directory with the dataset downloaded earlier.
+
+   **_NOTE:_**
+   How to choose `intermediate_size_list`?
+   The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps.
+
+   Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` MiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.
+
+   We can also set the target size of the resulting model using `num_params = 7_000_000_000`. This will be used as an upper bound for the number of parameters of the model.
+
 3. Run the compression script.
 
    ```bash
-   torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
+   torchrun --nproc_per_node 2 examples/compress/main.py --config examples/compress/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
    ```
 
    This will save the full output to `log.txt` and display the following progress on screen:
@@ -110,7 +122,7 @@ pip install -e .[hf,compress]
    Average losses = {'lm_loss': 1.7577573340386152, 'token_accuracy_top_1': 0.6225490570068359, 'token_accuracy_top_5': 0.846257209777832, 'token_accuracy_top_10': 0.8987817764282227}
    ```
 
-   30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction.
+   30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942).
 
 ## Re-run MIP Search with different constraints
 
@@ -194,6 +206,31 @@ lm_eval --model hf \
   --batch_size 4
 ```
 
-## Advanced usage
+## Inference Performance Benchmarking
+
+Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.
+
+- Install [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source).
+- Rearrange the model safetensors to be used for vLLM.
+
+```bash
+cd path/to/model
+mv subblocks_safetensors/* .
+sed -i 's+subblocks_safetensors/++g' model.safetensors.index.json
+```
+
+- Benchmark latency
+
+```bash
+vllm bench latency --model path/to/model --load-format safetensors --trust-remote-code
+```
+
+- Benchmark throughput
+
+```bash
+vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --load-format safetensors --trust-remote-code
+```
+
+## Advanced Usage
 
-Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios.
+Modify `llama-3_1-8B_pruneffn_memory.yaml` file for advanced compression scenarios.
@@ -9,7 +9,7 @@ defaults:
 puzzle_dir: ???
 teacher_dir: ${puzzle_dir}/ckpts/teacher/
 replacement_library_path: ${puzzle_dir}/replacement_library.json
-dataset_path: ??? # path to v0.4_mini
+dataset_path: ??? # ppath to Nemotron-Post-Training-Dataset-v2
 
 skip_realize_model: false
 
@@ -40,7 +40,7 @@ scoring:
   teacher_dir: ${to_path:${teacher_dir}}
   output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation
 
-  eval_samples: 10 # default is 128
+  eval_samples: 128
   micro_batch_size: 1
   seed: 42
   shuffle_seed: 444
@@ -77,6 +77,7 @@ mip:
 
   human_constraints:
     target_memory: 78_000
+    num_params: 7_000_000_000
 
   mip_constraints:
   metric_overrides:

@@ -14,7 +14,7 @@ puzzle_dir: /workspace/puzzle_dir
 # MIP memory constraint (in MiB) 
 mip:
   human_constraints:
-    target_memory: 96_000 # 96 GiB
+    target_memory: 78_000 # 78 GiB
 
 # FFN intermediate sizes to search over (heterogeneous architecture)
 pruning:

@@ -14,4 +14,4 @@ write_results: false
 calc_losses_on_cpu: false
 activations_log_dir:
 model_name_or_path:
-load_dataset_fn: ${get_object:utils.data.dataloaders.load_from_disk_fn}
+load_dataset_fn: ${get_object:modelopt.torch._compress.utils.data.dataloaders.load_from_disk_fn}
@@ -1020,6 +1020,9 @@ def __init__(self, config: DeciLMConfig, layer_idx: int | tuple[int, ...]):
         self.ffn_config = self.block_config.ffn
         self.layer_idx = layer_idx
 
+        if not config._attn_implementation:
+            config._attn_implementation = "eager"
+
         if not self.attention_config.no_op:
             self.input_layernorm = DeciLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
             if self.attention_config.replace_with_linear:

@@ -471,7 +471,6 @@ def create_child_state_dict(
     copy_start_time = time.time()
     keys_to_copy_from_orig_model = set(keys.values()) - ignored_keys
     for key in keys_to_copy_from_orig_model:
-        aprint(f"copying {key} from original_state_dict")
         # Memory optimization: avoid unnecessary copies
         tensor = original_state_dict[key]
         if not tensor.is_contiguous():
@@ -877,7 +876,6 @@ def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
     if len(ACTIVATIONS_LOG) == 0:
         assert "activations_log_dir" in mlp_init_config
         activations_log_dir = mlp_init_config["activations_log_dir"]
-        print(f"Loading activations_log from {activations_log_dir}")
         ACTIVATIONS_LOG.update(
             {
                 module_name: module_log

@@ -158,8 +158,8 @@ def validate_puzzle_solutions(args: DictConfig) -> None:
         list(zip(args.solutions_to_validate, puzzle_solutions)), desc="Validating solutions"
     ):
         layer_replacements = _extract_layer_replacements_from_puzzle_solution(puzzle_solution)
-        realizable_as_symlinks = can_realize_as_symlinks(layer_replacements)
-        # realizable_as_symlinks = False
+        # realizable_as_symlinks = can_realize_as_symlinks(layer_replacements)
+        realizable_as_symlinks = False
         model_config = replacement_library.create_model_config(layer_replacements)
         if (args.save_models and not realizable_as_symlinks) or (not args.skip_validation):
             model = replacement_library.load_model(layer_replacements)

@@ -14,6 +14,10 @@
 # limitations under the License.
 """Forward hooks for estimating importance scores for pruning."""
 
+from modelopt.torch.utils import import_plugin
+
 from .base_hooks import *
 from .base_hooks_analysis import *
-from .megatron_hooks import *
+
+with import_plugin("megatron_hooks"):
+    from .megatron_hooks import *