This repo contains implementations of some popular approaches for efficient deep learning training and inference:
-
batching.py is implementation of different ways to batching dataset for LLM training and inference from simple all batches padding to FlashAttention method.
-
scaler.py is implementation of scaler for mixed precision training similar to PyTorch GradScaler.
-
profiler.py is implementation of simple profiler that exports data in the perfetto trace event format and has PyTorch interface.
-
sync_batchnorm.py implementation of PyTorch SyncBatchNorm. It is BatchNorm with AllReduce for aggregation statistics from all GPUs.
-
offloading_linear.py is implementation of linear layer with CPU offloading with prefetching.
-
tensor_parallel_llama.py is implentation LLaMa with tensor parallelism. As bonus each linear layers for TP has gradient checkpointing. Also this file contains TP LLaMa with PyTorch DTensor.
-
fsdp.py is simple implementation of Fully Sharded Data Parallel.
-
speculative_decoding.py is simple implementation of SpecDec for optimizing LLM inference.
-
w8a8_matrix_mul.py is contains triton kernels for quantization to int8 and matrix multiplication with dequantization (also in int8).