Pylomin (PYtorch LOw-Memory INference) is a deep learning optimization library for low-memory inferencing in PyTorch.
The scale of deep learning models has grown exponentially in recent years, which has greatly increased the difficulty of product deployment.
Image source: Microsoft Research Blog
The goal of this library is to enable low-cost deployment of deep learning models:
- Extremely low memory requirement
- For example, we can reduce the peak memory requirement for the inference of a BERT-like model (with 1.6 GiB parameters) to 46 MiB.
- Minimize memory requirements while maintaining the model throughput
- Eliminate the time waiting for parameters to load by prefetching (under development)
-
TODO: add a number here after development
Peak memory is the maximum amount of memory needed to store model parameters and hidden states at any time during the model inference.
pylomin$ python3 -m pip install -e .
Load model parameters only when needed and release them immediately after use.
model = pylomin.lazy_loading(model)
Or provide a list of target_classes
or target_modules
to be converted to lazy-loading
mode.
In addition, when using target_classes
, you can also provide a list of modules to be skipped.
# Use target_classes
model = pylomin.lazy_loading(model, target_classes=[nn.Linear, nn.Embedding],
skip_modules=[model.embeddings.word_embeddings])
# Use target_modules
target_modules = [module for module in model.modules() if some_condition]
model = pylomin.lazy_loading(model, target_modules=target_modules)
Attempts to split an torch.nn.Embedding
layer into multiple chunks with each has num_embeddings
equal to chunk_size
, except the last one.
model = pylomin.chunked_embedding(model,
target_module_name='embeddings.word_embeddings',
chunk_size=2048)
See examples/
.