Release v0.4.1
What's Changed
- fix: fix the failed sampling unittest on 5090 by @yzh119 in #1886
- Updated to latest docker tag by @nvmbreughe in #1889
- Fix: Prevent race condition in cubin loader when file is being consumed by @yzh119 in #1852
- Improve graph caching of cudnn graph by @Anerudhan in #1887
- misc: Various Updates to Attention Microbenchmark Suite by @bkryu in #1891
- docs: Fix installation instructions for CUDA-specific package URLs by @yzh119 in #1893
- docker image improvements by @nvmbreughe in #1890
- tests: Add batch size 1 cases to test_trtllm_gen_attention.py that fail, marked xfail by @bkryu in #1897
- Ensure docker installs the torch version we need by @nvmbreughe in #1901
- bugfix: exclude
tests/utils/test_load_cubin_compile_race_condition.pyfrom pytest by @yzh119 in #1907 - ci: use self-hosted runner for building docker containers by @yzh119 in #1908
- feat: Add FP4 TRTLLM-Gen throughput MOE batched gemms by @jiahanc in #1882
- Update Docker CI tags to 20251010-8d072e6 by @github-actions[bot] in #1915
- ci/cd: consolidate release workflow by @yzh119 in #1910
- bugfix: fix cli error when cuda toolkit is not installed by @yzh119 in #1905
- feat: trtrllm-gen global scaled FP8 GEMMs by @hypdeb in #1829
- feat:enable fp8 blockscale moe for fused cultass for sm90 by @djmmoss in #1819
- use
ffi::TensorViewinstead offfi::Tensorby @cyx-6 in #1844 - Minor updates to cubin_loader.py download_file to avoid race condition on temporary file by @nvjullin in #1918
- chore: make cache directory flashinfer-version specific by @yzh119 in #1920
- misc: checksum check when downloading artifacts by @jimmyzho in #1761
- release: bump version v0.4.1 by @yzh119 in #1921
New Contributors
Full Changelog: v0.4.0...v0.4.1