Skip to content

A deep learning optimization library for low-memory inferencing in PyTorch.

License

Notifications You must be signed in to change notification settings

siahuat0727/pylomin

Repository files navigation

Pylomin

Tests

Pylomin (PYtorch LOw-Memory INference) is a deep learning optimization library for low-memory inferencing in PyTorch.

Motivation

The scale of deep learning models has grown exponentially in recent years, which has greatly increased the difficulty of product deployment.

Image source: Microsoft Research Blog

The goal of this library is to enable low-cost deployment of deep learning models:

  • Extremely low memory requirement
    • For example, we can reduce the peak memory requirement for the inference of a BERT-like model (with 1.6 GiB parameters) to 46 MiB.
  • Minimize memory requirements while maintaining the model throughput
    • Eliminate the time waiting for parameters to load by prefetching (under development)
    • TODO: add a number here after development

Peak memory is the maximum amount of memory needed to store model parameters and hidden states at any time during the model inference.

Installation

pylomin$ python3 -m pip install -e .

Getting Started

1. Lazy-loading

Load model parameters only when needed and release them immediately after use.

model = pylomin.lazy_loading(model)

Or provide a list of target_classes or target_modules to be converted to lazy-loading mode. In addition, when using target_classes, you can also provide a list of modules to be skipped.

# Use target_classes
model = pylomin.lazy_loading(model, target_classes=[nn.Linear, nn.Embedding],
                             skip_modules=[model.embeddings.word_embeddings])

# Use target_modules
target_modules = [module for module in model.modules() if some_condition]
model = pylomin.lazy_loading(model, target_modules=target_modules)

2. Chunked-embedding

Attempts to split an torch.nn.Embedding layer into multiple chunks with each has num_embeddings equal to chunk_size, except the last one.

model = pylomin.chunked_embedding(model,
                                  target_module_name='embeddings.word_embeddings',
                                  chunk_size=2048)

Examples

See examples/.

About

A deep learning optimization library for low-memory inferencing in PyTorch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages