Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 50 additions & 13 deletions examples/compress/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The supported modifications are:

To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.

In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.

## Environment

Expand All @@ -21,28 +21,40 @@ pip install -e .[hf,compress]

- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.

## Compress the Model

1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.
- To make use of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), you need to accept the terms and conditions for the corresponding model and the dataset in the Huggingface Hub. Log in to the Huggingface Hub and enter your HF token.

**_NOTE:_**
How to choose `intermediate_size_list`?
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps.
```bash
hf auth login
```

Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` MiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.
## Compress the Model

2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
1. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).

dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)

```bash
python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
```

2. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.

- `puzzle_dir` indicates a new directory for saving the resulting model.
- `input_hf_model_path` indicates the local directory with the input model checkpoint.
- `dataset_path` indicates the directory with the dataset downloaded earlier.

**_NOTE:_**
How to choose `intermediate_size_list`?
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps.

Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` MiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.

We can also set the target size of the resulting model using `num_params = 7_000_000_000`. This will be used as an upper bound for the number of parameters of the model.

3. Run the compression script.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
torchrun --nproc_per_node 2 examples/compress/main.py --config examples/compress/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
```

This will save the full output to `log.txt` and display the following progress on screen:
Expand Down Expand Up @@ -110,7 +122,7 @@ pip install -e .[hf,compress]
Average losses = {'lm_loss': 1.7577573340386152, 'token_accuracy_top_1': 0.6225490570068359, 'token_accuracy_top_5': 0.846257209777832, 'token_accuracy_top_10': 0.8987817764282227}
```

30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction.
30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942).

## Re-run MIP Search with different constraints

Expand Down Expand Up @@ -194,6 +206,31 @@ lm_eval --model hf \
--batch_size 4
```

## Advanced usage
## Inference Performance Benchmarking

Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.

- Install [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source).
- Rearrange the model safetensors to be used for vLLM.

```bash
cd path/to/model
mv subblocks_safetensors/* .
sed -i 's+subblocks_safetensors/++g' model.safetensors.index.json
```

- Benchmark latency

```bash
vllm bench latency --model path/to/model --load-format safetensors --trust-remote-code
```

- Benchmark throughput

```bash
vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --load-format safetensors --trust-remote-code
```

## Advanced Usage

Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios.
Modify `llama-3_1-8B_pruneffn_memory.yaml` file for advanced compression scenarios.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ defaults:
puzzle_dir: ???
teacher_dir: ${puzzle_dir}/ckpts/teacher/
replacement_library_path: ${puzzle_dir}/replacement_library.json
dataset_path: ??? # path to v0.4_mini
dataset_path: ??? # ppath to Nemotron-Post-Training-Dataset-v2

skip_realize_model: false

Expand Down Expand Up @@ -40,7 +40,7 @@ scoring:
teacher_dir: ${to_path:${teacher_dir}}
output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation

eval_samples: 10 # default is 128
eval_samples: 128
micro_batch_size: 1
seed: 42
shuffle_seed: 444
Expand Down Expand Up @@ -77,6 +77,7 @@ mip:

human_constraints:
target_memory: 78_000
num_params: 7_000_000_000

mip_constraints:
metric_overrides:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ puzzle_dir: /workspace/puzzle_dir
# MIP memory constraint (in MiB)
mip:
human_constraints:
target_memory: 96_000 # 96 GiB
target_memory: 78_000 # 78 GiB

# FFN intermediate sizes to search over (heterogeneous architecture)
pruning:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ write_results: false
calc_losses_on_cpu: false
activations_log_dir:
model_name_or_path:
load_dataset_fn: ${get_object:utils.data.dataloaders.load_from_disk_fn}
load_dataset_fn: ${get_object:modelopt.torch._compress.utils.data.dataloaders.load_from_disk_fn}
Original file line number Diff line number Diff line change
Expand Up @@ -1020,6 +1020,9 @@ def __init__(self, config: DeciLMConfig, layer_idx: int | tuple[int, ...]):
self.ffn_config = self.block_config.ffn
self.layer_idx = layer_idx

if not config._attn_implementation:
config._attn_implementation = "eager"

if not self.attention_config.no_op:
self.input_layernorm = DeciLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
if self.attention_config.replace_with_linear:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,6 @@ def create_child_state_dict(
copy_start_time = time.time()
keys_to_copy_from_orig_model = set(keys.values()) - ignored_keys
for key in keys_to_copy_from_orig_model:
aprint(f"copying {key} from original_state_dict")
# Memory optimization: avoid unnecessary copies
tensor = original_state_dict[key]
if not tensor.is_contiguous():
Expand Down Expand Up @@ -877,7 +876,6 @@ def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
if len(ACTIVATIONS_LOG) == 0:
assert "activations_log_dir" in mlp_init_config
activations_log_dir = mlp_init_config["activations_log_dir"]
print(f"Loading activations_log from {activations_log_dir}")
ACTIVATIONS_LOG.update(
{
module_name: module_log
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,8 @@ def validate_puzzle_solutions(args: DictConfig) -> None:
list(zip(args.solutions_to_validate, puzzle_solutions)), desc="Validating solutions"
):
layer_replacements = _extract_layer_replacements_from_puzzle_solution(puzzle_solution)
realizable_as_symlinks = can_realize_as_symlinks(layer_replacements)
# realizable_as_symlinks = False
# realizable_as_symlinks = can_realize_as_symlinks(layer_replacements)
realizable_as_symlinks = False
model_config = replacement_library.create_model_config(layer_replacements)
if (args.save_models and not realizable_as_symlinks) or (not args.skip_validation):
model = replacement_library.load_model(layer_replacements)
Expand Down
6 changes: 5 additions & 1 deletion modelopt/torch/nas/plugins/megatron_hooks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@
# limitations under the License.
"""Forward hooks for estimating importance scores for pruning."""

from modelopt.torch.utils import import_plugin

from .base_hooks import *
from .base_hooks_analysis import *
from .megatron_hooks import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont want to remove it but rather guard the import so if megatron is not present, this does not raise an error. Like so:

from modelopt.torch.utils import import_plugin

with import_plugin("megatron_hooks"):
    from .megatron_hooks import *

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks


with import_plugin("megatron_hooks"):
from .megatron_hooks import *