Highlights
-
AMD GPU: We have partnered with Embedding LLM to adjust the Triton configuration to fully support AMD! With version 0.4.0, you can run multi-GPU training with 26% higher speed and 60% lower memory usage on AMD. See the full blogpost from https://embeddedllm.com/blog/cuda-to-rocm-portability-case-study-liger-kernel. @Edenzzzz @DocShotgun @tjtanaa
-
Technical Report: We have published a technical report on arXiv (https://arxiv.org/pdf/2410.10989) with abundant details.
-
Modal CI: We have moved our entire GPU CI stack to Modal! Thanks to intelligent Docker layer caching and blazingly fast container startup time and scheduling, we have reduced the CI overhead by over 10x (from minutes to seconds).
-
LLaMA 3.2-Vision Model: We have added kernel support for the LLaMA 3.2-Vision model. You can easily use
liger_kernel.transformers.apply_liger_kernel_to_mllama
to patch the model. @tyler-romero @shivam15s -
JSD Kernel: We have added the JSD kernel for distillation, which also comes with a chunking version! @Tcc0403 @yundai424 @qingquansong
-
HuggingFace Gradient Accumulation Fixes: We have fixed the notorious HuggingFace gradient accumulation issue (huggingface/transformers#34191) by carefully adjusting the cross entropy scalar. You can now safely use v0.4.0 with the latest HuggingFace gradient accumulation fixes (transformers>=4.46.2)!
What's Changed
- Acknowledgement in NOTICE file by @momochen in #287
- Add JSD kernel by @Tcc0403 in #264
- Cancel in-progress but out-of-date GPU actions by @tyler-romero in #289
- Fix assert_verbose_allclose bugs by @Tcc0403 in #261
- fix qwen2-vl: create correct rope position_ids when position_ids is None by @Sanster in #276
- Add missing Qwen2-VL monkey patch test by @tyler-romero in #283
- FIX: tl.program_id() does indeed not have a cast method in triton2.3.1 by @wizyoung in #274
- RMSNorm aggregation by @Tcc0403 in #255
- FEAT Adding experimental feature : Triton mm int8xint2 by @MekkCyber in #195
- Add beta support for jsd by @Tcc0403 in #290
- chore: update cross_entropy.py by @eltociear in #293
- Apache and MIT license reference by @momochen in #294
- Monkeypatch for Llama 3.2-Vision by @tyler-romero in #282
- Add FusedLinearJSD by @Tcc0403 in #300
- Move
logits.float()
call by @ringohoffman in #308 - Added contributors and back to top by @barbarian360 in #304
- Add ignore_index and label to jsd and fl-jsd by @Tcc0403 in #306
- Monkey patch layer norm in mllama by @shivam15s in #302
- Introducing Liger Kernel Guru on Gurubase.io by @kursataktas in #316
- Update citation and add tech report by @ByronHsu in #317
- fix FLCE AMP issue by @yundai424 in #318
- fix fused JSD with ignore index by @yundai424 in #330
- Add missing ignore_index tests by @Tcc0403 in #310
- docs(CONTRIBUTING): fix typo by @novanish in #331
- Fix huggingface GA issue for llama by @ByronHsu in #333
- Fix incorrect training of first and last Medusa heads by @chiwanpark in #325
- Fix FusedLinearJSD precision issue when using AMP by @yundai424 in #336
- Fix llama forward patch by @hiyouga in #339
- [AMD] [ROCm] Pick
num_warps
based on platform by @tjtanaa in #326 - set up modal ci by @ByronHsu in #344
- avoid duplicate ci by @ByronHsu in #345
- Aggressively trim unit test bloat by @ByronHsu in #346
- Trim conv test by @ByronHsu in #348
- merge two tests into one by @ByronHsu in #349
- broadcast grad acc fix to all models by @ByronHsu in #354
New Contributors
- @Sanster made their first contribution in #276
- @MekkCyber made their first contribution in #195
- @ringohoffman made their first contribution in #308
- @barbarian360 made their first contribution in #304
- @kursataktas made their first contribution in #316
- @novanish made their first contribution in #331
- @hiyouga made their first contribution in #339
- @tjtanaa made their first contribution in #326
Full Changelog: v0.3.1...v0.4.0