Releases: linkedin/Liger-Kernel
v0.5.5: Chunk size fixes for JSD; KTO speed fixes; better metrics tests
What's Changed
- Infer correct device for AMD HIP device by @helloworld1 in #587
- add out of bounds check to cross entropy by @shivam15s in #588
- Monkeypatch for Qwen2.5-VL by @BenasdTW in #552
- KTO changes to return aux outputs by @vaibhavjindal in #589
- [KTO] Only return summed metrics by @vaibhavjindal in #591
- increase chunk size for distillation and add bias to jsd by @shivam15s in #590
- [CI] Add ROCm 6.3 CI by @tjtanaa in #506
- Fix KTO speed issue by @vaibhavjindal in #592
- Compare means of aggregated outputs in KTO tests by @vaibhavjindal in #595
- Fix means of logps and rewards by @vaibhavjindal in #597
- Add chunk_size param to chunked losses by @RichhLi in #599
- Fix DPO/ORPO typo in readme by @tyler-romero in #602
- version bump by @shivam15s in #605
New Contributors
Full Changelog: v0.5.4...v0.5.5
v0.5.4: Granite 3.0 & 3.1, OLMo2, GRPO, TVD loss, and minor fixes
What's Changed
- add GitHub CI for Intel GPU by @faaany in #536
- Add Intel GPU CI to README.md by @hebiao064 in #562
- test split to 16, 32 by @jp1924 in #564
- Clean up workaround introduced in PR #564 by @austin362667 in #566
- Update README.md by @momochen in #567
- Grpo loss by @kashif in #553
- Update Readme with ROCM installation instruction by @zcnrex in #570
- fix qwen2vl and mllama test to pass failing tests by @shivam15s in #571
- KTO: Minor fix and documentation update by @vaibhavjindal in #574
- Add TVD Loss Kernel by @saurabhkoshatwar in #324
- Add KTO Benchmark Data into README by @hebiao064 in #575
- Support Granite 3.0 and 3.1 models by @JamesKunstle in #558
- Improve Hugging Face SFT Script by @ParagEkbote in #539
- Add unit tests for shared prefix masked attention with
torch.FlexAttention
by @austin362667 in #504 - update project readme to include Granite support by @JamesKunstle in #576
- Revert "Improve Hugging Face SFT Script (#539)" and Fix TVD Test for Intel #580 by @shivam15s in #578
- Fix Rope Test by @hebiao064 in #577
- Fix layer norm kernels by @lancerts in #582
- Add OLMO2 model support by @yundai424 in #581
- bump version to 0.5.4 by @yundai424 in #585
New Contributors
- @jp1924 made their first contribution in #564
- @zcnrex made their first contribution in #570
- @vaibhavjindal made their first contribution in #574
- @saurabhkoshatwar made their first contribution in #324
- @JamesKunstle made their first contribution in #558
Full Changelog: v0.5.3...v0.5.4
v0.5.3: Minor fixes for post-training losses and support for KTO Loss
What's Changed
- Add ref_input parameter to support separate inputs for reference model by @xingyaoww in #467
- Revert "Add ref_input parameter to support separate inputs for reference model" by @ByronHsu in #469
- Add dynamic dependency management for CUDA and ROCm by @hebiao064 in #460
- [CI] runtime pip install using uv by @ByronHsu in #471
- modify ref_input in chunked_loss base class and fix tests by @shivam15s in #470
- Add more post training in readme by @ByronHsu in #472
- align post training loss at the center by @ByronHsu in #473
- [Transformer] fix ORPO loss for MOE models by @kashif in #479
- fix: correct typos in docstrings by @shivam15s in #482
- fix chosen_nll_loss in chunked losses by @kashif in #486
- Revert "fix chosen_nll_loss in chunked losses (#486)" by @shivam15s in #489
- fix dpo tests: reduce tolerance and change default compute_nll_loss false by @shivam15s in #490
- CPO & SimPO add label_smoothing by @Mecoli1219 in #493
- Fix Preference Loss and Refactor for Readability by @austin362667 in #484
- annotate tl constexpr values by @winglian in #497
- Fix Rope Compatibility with Cos/Sin Position Embedding for Batch Size > 1 by @wizyoung in #477
- Move the checkstyle to Ruff by @shivam15s in #483
- Fix/liger fused linear cross entropy function does not support reduction=none by @ryankert01 in #496
- Fix Dtype Mismatch in torch.addmm within ops/fused_linear_cross_entropy.py in AMP training. by @DandinPower in #502
- Add weight support for LigerCrossEntropy by @Tcc0403 in #420
- Refactor Temperature Scaling in Distillation Loss by @austin362667 in #444
- Fix All
chunked_loss
Benchmark Scripts by @austin362667 in #438 - Set z_loss_1d=None when return_z_loss=False in cross_entropy_loss to avoid tl.store fail when triton_interpret=1(for tl.device_print etc.) by @wa008 in #508
- Add
aux_outputs
for CPO and SimPO by @Mecoli1219 in #492 - Add
average_log_prob
args for cpo by @Mecoli1219 in #510 - Refactor CrossEntropy and FusedLinearCrossEntropy by @Tcc0403 in #511
- [ORPO] add nll_target for orpo nll loss by @kashif in #503
- Format Benchmark Scripts with Ruff by @austin362667 in #516
- [Tiny] Add QVQ to readme by @tyler-romero in #522
- Add argument
return_z_loss
to flce by @Tcc0403 in #530 - Remove extra print by @apaz-cli in #531
- Fix HF
transformers
Breaking Changes by @austin362667 in #526 - Handle cache_position for transformers 4.47.0 and later (#528) by @BenasdTW in #529
- Create Docs for Liger-Kernel by @ParagEkbote in #485
- Add Mkdocs related dependencies to setup.py by @hebiao064 in #534
- Add KTO Loss by @hebiao064 in #475
- [tests] use a valid hexadecimal string instead of a placeholder by @faaany in #535
- [tests] skip failed tests for xpu by @faaany in #498
- Format files by @austin362667 in #541
- Fix Broken Links by @ParagEkbote in #547
- [Fix] Fix the type hint of
test_utils::concatenated_forward
by @hongpeng-guo in #549 - Add JSD Loss for Distillation by @austin362667 in #425
- [DPO] add reference log-prob outputs in DPO by @kashif in #521
- Fix DPO unit test fail and refactor by @Tcc0403 in #554
New Contributors
- @xingyaoww made their first contribution in #467
- @kashif made their first contribution in #479
- @Mecoli1219 made their first contribution in #493
- @winglian made their first contribution in #497
- @DandinPower made their first contribution in #502
- @wa008 made their first contribution in #508
- @apaz-cli made their first contribution in #531
- @BenasdTW made their first contribution in #529
- @ParagEkbote made their first contribution in #485
Full Changelog: v0.5.2...v0.5.3
v0.5.2: Fix Qwen2VL mrope for transformer>=4.47
What's Changed
- Disable Qwen2 VL test for with logits conv test by @ByronHsu in #463
- Fix Qwen2VL mrope for transformers 4.47.0 by @li-plus in #464
- Revert Workaround of Disabling QWEN2_VL in Convergence Tests by @austin362667 in #466
Full Changelog: v0.5.1...v0.5.2
v0.5.1: Patch Fix Import Error
What's Changed
Full Changelog: v0.5.0...v0.5.1
v0.5.0: First open source optimized Post Training Loss, AMD CI, XPU Support
Highlights
- Post Training Loss: Introducing the first open-source optimized post-training losses in Liger Kernel with ~80% memory reduction, featuring DPO, CPO, ORPO, SimPO, JSD, and more. No more OOM nightmares for post-training ML researchers!

- AMD CI: With AMDβs generous sponsorship of MI300s, weβve integrated them into our CI. Special thanks to Embedded LLM for building the AMD CI infrastructure. #428
- XPU Support: In collaboration with Intel, we now support XPU, demonstrating comparable performance gains with other vendors. #407
What's Changed
- Adds the CPO Alignment Loss Function by @pramodith in #382
- Qwen2-VL Training Example w/ Liger by @tyler-romero in #389
- Support Qwen2-VL's multimodal RoPE implementation by @li-plus in #384
- add xpu device support for
rms_norm
by @faaany in #379 - fix qwen2 import failure in test by @ByronHsu in #394
- Add Chunked SimPO Loss by @pramodith in #386
- Add script to reproducibly run examples on Modal by @tyler-romero in #397
- add nn.module support for chunked loss function by @shivam15s in #402
- Generalize JSD to FKL/RKL by @yundai424 in #393
- Enable keyword arguments for liger functional by @hongpeng-guo in #400
- add reference model logps to chunkedloss interface and fix dpo loss fn by @shivam15s in #405
- Optimize CE Loss by casting dtype to float32 inside kernel by @pramodith in #406
- Xpu support by @mgrabban in #407
- Fix
get_batch_loss_metrics
comments by @austin362667 in #413 - Add rebuild to CI by @ByronHsu in #415
- Fix os env by @ByronHsu in #416
- Adjust QWEN2 VL Loss
rtol
by @austin362667 in #412 - [tiny] Add QwQ to readme (same arch as Qwen2) by @tyler-romero in #424
- Enhance Cross Entropy Softcap Unit Test by @austin362667 in #423
- Add ORPO Trainer + support HF metrics directly from chunked loss functions + fixes to avoid torch compile recompilations by @shivam15s in #429
- Add Build Success/Fail Badge by @hebiao064 in #431
- Switch amd-ci to use MI300X runner. by @saienduri in #428
- [CI] rename ci and add cron job for amd by @ByronHsu in #433
- [CI] shorten ci name by @ByronHsu in #434
- update ci icon on readme by @bboyleonp666 in #440
- Introduce Knowledge Distillation Base by @austin362667 in #432
- [AMD] [CI] Clean up
amd-ci
by @tjtanaa in #436 - Add xpu in env report by @abhilash1910 in #443
- Specify scheduled CI in AMD badge by @ByronHsu in #446
- improve code quality for chunk loss by @ByronHsu in #448
- Add paper link and formula for preference loss by @ByronHsu in #449
- Make kernel doc lean by @ByronHsu in #450
- Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode by @hebiao064 in #435
- add eng blog by @ByronHsu in #452
- add chunked loss to readme by @shivam15s in #453
- change chunked readme by @shivam15s in #454
- add sponsorship and collab by @ByronHsu in #457
- version bump to 0.5.0 by @shivam15s in #455
- Add HIP (ROCm) and Liger Kernel to env report by @Comet0322 in #456
New Contributors
- @li-plus made their first contribution in #384
- @faaany made their first contribution in #379
- @hongpeng-guo made their first contribution in #400
- @mgrabban made their first contribution in #407
- @hebiao064 made their first contribution in #431
- @saienduri made their first contribution in #428
- @bboyleonp666 made their first contribution in #440
- @abhilash1910 made their first contribution in #443
- @Comet0322 made their first contribution in #456
v0.4.2: Fix 'RMSNorm' object has no attribute 'in_place'
Highlights
What's Changed
- modify readmes and create license/acknowledgement docs by @shivam15s in #377
- Add Chunked ORPO Loss by @shivam15s in #362
- Refactor
LigerFusedLinearPreferenceBase
by @pramodith in #381 - Support Chunked DPO Loss Kernel by @austin362667 in #378
- Fix flce not being patched after reverting in convergence test by @Tcc0403 in #385
- Qwen2-VL Bug / Incompatibility Fixes by @tyler-romero in #388
- Fix incomplete RMSNorm patch by @Tcc0403 in #392
Full Changelog: v0.4.1...v0.4.2
v0.4.1: Gemma 2 Support, CrossEntropy Patching FIx, and GroupNorm
Highlights
-
Gemma 2 Support: The long pending gemma 2 is finally supported thanks to @Tcc0403! He has implemented the nasty softcapping in fused linear cross entropy (#320) and discovered the convergence issue which later fixed by @ByronHsu and @Tcc0403 together. (#376)
-
CrossEntropy Patching FIx: If you use monkey patch for
CrossEntropy
(Not FLCE), it is actually not patched after transformers4.46.1
. This is becauseCrossEntropy
was replaced withF.cross_entropy
in the model code. We fixed the issue in the PR (#375) -
GroupNorm Kernel: Our new contributor @pramodith implemented a GroupNorm kernel #375 with 2x Speedup.
What's Changed
- BUG: Fix bug in layer norm tests. by @pramodith in #359
- Support Z Loss in CE by @Tcc0403 in #239
- Improve compatibility to access the base models by @why-in-Shanghaitech in #340
- poke test again by @ByronHsu in #360
- Kernels for GroupNorm by @pramodith in #353
- Remove trailing newline. by @ckckjw in #364
- Fix typo in the description of FusedLinearJSD by @Tcc0403 in #366
- Updates Readme to add GroupNorm by @pramodith in #365
- Support FusedLinearCrossEntropy for Gemma2 by @Tcc0403 in #320
- Rotate modal and pypi tokens by @ByronHsu in #372
- Fix release password by @ByronHsu in #373
- Support CE after grad acc fix by @ByronHsu in #375
- Support out-of-place RMSNorm to fix gemma2 by @ByronHsu in #376
New Contributors
- @pramodith made their first contribution in #359
- @why-in-Shanghaitech made their first contribution in #340
- @ckckjw made their first contribution in #364
Full Changelog: v0.4.0...v0.4.1
v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision!
Highlights
-
AMD GPU: We have partnered with Embedding LLM to adjust the Triton configuration to fully support AMD! With version 0.4.0, you can run multi-GPU training with 26% higher speed and 60% lower memory usage on AMD. See the full blogpost from https://embeddedllm.com/blog/cuda-to-rocm-portability-case-study-liger-kernel. @Edenzzzz @DocShotgun @tjtanaa
-
Technical Report: We have published a technical report on arXiv (https://arxiv.org/pdf/2410.10989) with abundant details.
-
Modal CI: We have moved our entire GPU CI stack to Modal! Thanks to intelligent Docker layer caching and blazingly fast container startup time and scheduling, we have reduced the CI overhead by over 10x (from minutes to seconds).
-
LLaMA 3.2-Vision Model: We have added kernel support for the LLaMA 3.2-Vision model. You can easily use
liger_kernel.transformers.apply_liger_kernel_to_mllama
to patch the model. @tyler-romero @shivam15s -
JSD Kernel: We have added the JSD kernel for distillation, which also comes with a chunking version! @Tcc0403 @yundai424 @qingquansong
-
HuggingFace Gradient Accumulation Fixes: We have fixed the notorious HuggingFace gradient accumulation issue (huggingface/transformers#34191) by carefully adjusting the cross entropy scalar. You can now safely use v0.4.0 with the latest HuggingFace gradient accumulation fixes (transformers>=4.46.2)!
What's Changed
- Acknowledgement in NOTICE file by @momochen in #287
- Add JSD kernel by @Tcc0403 in #264
- Cancel in-progress but out-of-date GPU actions by @tyler-romero in #289
- Fix assert_verbose_allclose bugs by @Tcc0403 in #261
- fix qwen2-vl: create correct rope position_ids when position_ids is None by @Sanster in #276
- Add missing Qwen2-VL monkey patch test by @tyler-romero in #283
- FIX: tl.program_id() does indeed not have a cast method in triton2.3.1 by @wizyoung in #274
- RMSNorm aggregation by @Tcc0403 in #255
- FEAT Adding experimental feature : Triton mm int8xint2 by @MekkCyber in #195
- Add beta support for jsd by @Tcc0403 in #290
- chore: update cross_entropy.py by @eltociear in #293
- Apache and MIT license reference by @momochen in #294
- Monkeypatch for Llama 3.2-Vision by @tyler-romero in #282
- Add FusedLinearJSD by @Tcc0403 in #300
- Move
logits.float()
call by @ringohoffman in #308 - Added contributors and back to top by @barbarian360 in #304
- Add ignore_index and label to jsd and fl-jsd by @Tcc0403 in #306
- Monkey patch layer norm in mllama by @shivam15s in #302
- Introducing Liger Kernel Guru on Gurubase.io by @kursataktas in #316
- Update citation and add tech report by @ByronHsu in #317
- fix FLCE AMP issue by @yundai424 in #318
- fix fused JSD with ignore index by @yundai424 in #330
- Add missing ignore_index tests by @Tcc0403 in #310
- docs(CONTRIBUTING): fix typo by @novanish in #331
- Fix huggingface GA issue for llama by @ByronHsu in #333
- Fix incorrect training of first and last Medusa heads by @chiwanpark in #325
- Fix FusedLinearJSD precision issue when using AMP by @yundai424 in #336
- Fix llama forward patch by @hiyouga in #339
- [AMD] [ROCm] Pick
num_warps
based on platform by @tjtanaa in #326 - set up modal ci by @ByronHsu in #344
- avoid duplicate ci by @ByronHsu in #345
- Aggressively trim unit test bloat by @ByronHsu in #346
- Trim conv test by @ByronHsu in #348
- merge two tests into one by @ByronHsu in #349
- broadcast grad acc fix to all models by @ByronHsu in #354
New Contributors
- @Sanster made their first contribution in #276
- @MekkCyber made their first contribution in #195
- @ringohoffman made their first contribution in #308
- @barbarian360 made their first contribution in #304
- @kursataktas made their first contribution in #316
- @novanish made their first contribution in #331
- @hiyouga made their first contribution in #339
- @tjtanaa made their first contribution in #326
Full Changelog: v0.3.1...v0.4.0
v0.3.1: Patch Release
Summary
This patch release brings important updates and fixes to Liger-Kernel. Notable changes include:
- KLDiv calculation fix: KLDiv now functions correctly with larger vocab sizes
- SwiGLU/GeGLU casting fix: Program IDs are now cast to int64 in SwiGLU/GeGLU kernels to prevent memory errors with larger dimensions.
- AutoLigerKernelForCausalLM fix: The model now properly passes through all original keyword arguments
- Post-init model patching fix: Fix to post-init model patching to ensure HF Trainer integration works correctly
- Relaxed transformers dependency: Improve compatibility with a broader range of versions.
What's Changed
- Remove debug print statement by @EdoardoLuciani in #247
- [Easy] Cast program_id to int64 in SwiGLU/GeGLU kernels by @hansonw in #251
- Fix a comment typo in flce by @Tcc0403 in #256
- Fix AutoLigerKernelForCausalLM to pass through original kwargs by @shimizust in #263
- Update contributing guide for adding a new model by @shivam15s in #260
- chore: Add Qwen2.5 and Phi3.5 to Readme by @tyler-romero in #265
- rename cuda mode to gpu mode by @msaroufim in #267
- Fix sharing a ResBlock layer for each head in Medusa example by @chiwanpark in #269
- Fix/kldiv by @S1ro1 in #262
- Post-init model patching fix by @shimizust in #280
- Relaxed transformers dependency by @shimizust in #270
- Disable gemma2 and qwen2_vl tests by @shimizust in #288
- Release version 0.3.1 by @shimizust in #286
New Contributors
- @EdoardoLuciani made their first contribution in #247
- @msaroufim made their first contribution in #267
Full Changelog: v0.3.0...v0.3.1