efficient-dl-systems

This repo contains implementations of some popular approaches for efficient deep learning training and inference:

batching.py is implementation of different ways to batching dataset for LLM training and inference from simple all batches padding to FlashAttention method.
scaler.py is implementation of scaler for mixed precision training similar to PyTorch GradScaler.
profiler.py is implementation of simple profiler that exports data in the perfetto trace event format and has PyTorch interface.
sync_batchnorm.py implementation of PyTorch SyncBatchNorm. It is BatchNorm with AllReduce for aggregation statistics from all GPUs.
offloading_linear.py is implementation of linear layer with CPU offloading with prefetching.
tensor_parallel_llama.py is implentation LLaMa with tensor parallelism. As bonus each linear layers for TP has gradient checkpointing. Also this file contains TP LLaMa with PyTorch DTensor.
fsdp.py is simple implementation of Fully Sharded Data Parallel.
speculative_decoding.py is simple implementation of SpecDec for optimizing LLM inference.
w8a8_matrix_mul.py is contains triton kernels for quantization to int8 and matrix multiplication with dequantization (also in int8).

Provide feedback

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
batching.py		batching.py
fsdp.py		fsdp.py
offloading_linear.py		offloading_linear.py
profiler.py		profiler.py
scaler.py		scaler.py
speculative_decoding.py		speculative_decoding.py
sync_batchnorm.py		sync_batchnorm.py
tensor_parallel_llama.py		tensor_parallel_llama.py
w8a8_matrix_mul.py		w8a8_matrix_mul.py