Skip to content

FMS Model Optimizer is a framework for developing reduced precision neural network models.

License

Notifications You must be signed in to change notification settings

foundation-model-stack/fms-model-optimizer

Repository files navigation

FMS Model Optimizer

Lint Tests Build Minimum Python Version Release License

Introduction

FMS Model Optimizer is a framework for developing reduced precision neural network models. Quantization techniques, such as quantization-aware-training (QAT), post-training quantization (PTQ), and several other optimization techniques on popular deep learning workloads are supported.

Highlights

  • Python API to enable model quantization: With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
  • Robust: Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection/LLMs.
  • Flexible: Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
  • State-of-the-art INT and FP quantization techniques for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
  • Supports key compute-intensive operations like Conv2d, Linear, LSTM, MM and BMM

Supported Models

GPTQ FP8 PTQ QAT
Granite 🔲
Llama 🔲
Mixtral 🔲
BERT/Roberta

Note: Direct QAT on LLMs is not recommended

Getting Started

Requirements

  1. 🐧 Linux system with Nvidia GPU (V100/A100/H100)

  2. Python 3.9 to Python 3.11

    📋 Python 3.12 is currently not supported due to PyTorch Dynamo constraint

  3. CUDA >=12

Optional packages based on optimization functionality required:

  • GPTQ is a popular compression method for LLMs:
  • If you want to experiment with INT8 deployment in QAT and PTQ examples:
    • Nvidia GPU with compute capability > 8.0 (A100 family or higher)
    • Ninja
    • Clone the CUTLASS repository
    • PyTorch 2.3.1 (as newer version will cause issue for the custom CUDA kernel used in these examples)
  • FP8 is a reduced precision format like INT8:
  • To enable compute graph plotting function (mostly for troubleshooting purpose):

Note

PyTorch version should be < 2.4 if you would like to experiment deployment with external INT8 kernel.

Installation

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate

Tip

If you use pyenv, Conda Miniforge or other such tools for Python version management, create the virtual environment with that tool instead of venv. Otherwise, you may have issues with installed packages not being found as they are linked to your Python version management tool and not venv.

There are 2 ways to install the FMS Model Optimizer as follows:

From Release

To install from release (PyPi package):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
pip install fms-model-optimizer

From Source

To install from source(GitHub Repository):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
git clone https://github.com/foundation-model-stack/fms-model-optimizer
cd fms-model-optimizer
pip install -e .

Try It Out!

To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:

  • Jupyter notebook tutorials (It is recommended to begin here):
    • Quantization tutorial:
      • Visualizes a random Gaussian tensor step-by-step along the quantization process
      • Build a quantizer and quantized convolution module based on this process
  • Python script examples

Docs

Dive into the design document to get a better understanding of the framework motivation and concepts.

Contributing

Check out our contributing guide to learn how to contribute.