Release v0.6.0_rocm · ROCm/vllm

What's Changed

Features integration without fp8 by @gshtras in #7
Layernorm optimizations by @mawong-amd in #8
Bringing in the latest commits from upstream by @mawong-amd in #9
Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes by @mawong-amd in #12
add mi300 fused_moe tuned configs by @divakar-amd in #13
Correctly calculating the same value for the required cache blocks num for all torchrun processes by @gshtras in #15
[ROCm] adding a missing triton autotune config by @hongxiayang in #17
make the vllm setup mode configurable and make install mode as defaul… by @hongxiayang in #18
enable fused topK_softmax kernel for hip by @divakar-amd in #14
Fix ambiguous fma call by @cjatin in #16
Rccl dockerfile updates by @mawong-amd in #19
Dockerfile improvements: multistage by @mawong-amd in #20
Integrate PagedAttention Optimization custom kernel into vLLM by @lcskrishna in #22
Updates to custom PagedAttention for supporting context len upto 32k. by @lcskrishna in #25
Update max_context_len for custom paged attention. by @lcskrishna in #26
Update RCCL, hipBLASLt, base image in Dockerfile.rocm by @shajrawi in #24
Adding fp8 gemm computation by @charlifu in #29
fix the model loading fp8 by @charlifu in #30
Update linear.py by @gshtras in #32
Update base docker image with Pytorch 2.3 by @charlifu in #35
Removed HIP specific matvec logic that is duplicated from tuned_gemm.py and doesn't support bf16 by @gshtras in #23
Use inp_view for out = F.linear() in TunedGemm by @charlifu in #36
Fix the symbol not found issue of the new base image by @charlifu in #37
G42 bias triton fix rocm main by @gshtras in #38
Update ROCm vLLM to 0.4.3 by @mawong-amd in #40
Re-applying G42 bias triton fix on 0.4.3 by @gshtras in #41
Fix RCCL install, linear.py logic, CMake custom extension, update requirement for FP8 compute by @mawong-amd in #42
Linting main in line with upstream requirements by @mawong-amd in #43
Include benchmark scripts in container by @mawong-amd in #45
Adding fp8 to gradlib by @charlifu in #44
Update fp8_gemm_tuner.py exchange import torch and hipbsolidxgemm by @liligwu in #46
Supporting quantized weights from Quark by default. by @charlifu in #47
Update quark quantizer command in fp8 instruction by @charlifu in #49
Fix LLMM1 kernel by @fxmarty in #28
Use scaled mm for untuned fp8 gemm by @charlifu in #50
tuned moe configs v2 by @divakar-amd in #33
Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…" by @hthangirala in #51
Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…"" by @divakar-amd in #53
fix init files by @divakar-amd in #52
adds wvSpltK optimization for skinny gemm. by @amd-hhashemi in #54
Fix 8K decode latency jump issue. by @lcskrishna in #55
Adding quantization_weights_path for fp8 weights by @charlifu in #57
Refactor custom gemm heuristics by @gshtras in #56
wvSpltK fix for 10GB+ output tensors by @amd-hhashemi in #61
uint64_t instead of unsigned long for clarity by @mawong-amd in #62
fix for oob LDS fill in wvSpltK slm version by @amd-hhashemi in #63
[Kernel] Enable custom AR on ROCm by @wenkaidu in #27
Fix the Runtime Error When Loading kv cache scales by @charlifu in #65
Fix numpy and XGMI 1-hop detection by @mawong-amd in #67
Fix XGMI linting by @mawong-amd in #68
Merging fp8_gemm_tuner.py to gemm_tuner.py by @charlifu in #66
Wokaround for SWDEV-470361 by @gshtras in #69
[1/2] Fix up ROCm 6.2 tests correctly in main by @mawong-amd in #72
[2/2] Using xfail instead of skip for ROCm 6.2 tests by @mawong-amd in #70
Dockerfile updates: base image, preemptive uninstalls; restore ROCm 6.2 metrics test by @mawong-amd in #73
Return int64 dtype for solidx in tuning results by @charlifu in #74
[Build/CI] tests for rocm/vllm:main as of 2024-06-28 by @Alexei-V-Ivanov-AMD in #77
Fix gradlib fp8 output by @charlifu in #76
Allocate workspace for hipblaslt fp8 gemm. by @charlifu in #78
Mixtral moe tuning for mi308 by @divakar-amd in #80
Remove elementwise kernel before each fp8 gemm by @charlifu in #81
Charlifu/avoid tensor creation before each gemm by @HaiShaw in #82
TP=1 moe tuning for mixtral-8x7B by @divakar-amd in #84
Mixtral-8x22B tuning mi308x by @divakar-amd in #85
moe tuning for larger input lens by @divakar-amd in #86
Reduce csv writes by @charlifu in #92
fix the type error due to the miss-use of the logging module by @liligwu in #105
Update Dockerfile.rocm by @shajrawi in #107
Greg/fast server by @gshtras in #106
converts wvSpltK reduce to pure dpp for further perf uplift. by @amd-hhashemi in #64
Revert "Fix 8K decode latency jump issue." by @mawong-amd in #108
adding a simple model invocation involving fp8 calculation/storage by @Alexei-V-Ivanov-AMD in #109
Adding bf16 output dtype for fp8 gemm by @charlifu in #111
Running server and LLM in different processes by @gshtras in #110
Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters by @gshtras in #114
Add distributed executor backend to benchmark scripts by @mawong-amd in #118
Add weight padding for moe by @charlifu in #119
[BugFix] Fix navi build after many custom for MI kernels added by @maleksan85 in #116
add emtpy_cache() after each padding by @charlifu in #120
[FIX] Gradlib OOM on Navi and sometimes on MI by @maleksan85 in #124
Save shape when fp8 solution not found by @charlifu in #123
Fix unit test for moe by adding padding by @charlifu in #128
Llama3.1 by @gshtras in #129
chat/completions endpoint by @gshtras in #121
Optimize custom all reduce by @iotamudelta in #130
Add BF16 support to custom PA by @sanyalington in #133
Making check for output match in original types. It saves some memory. by @maleksan85 in #135
Make CAR ROCm 6.1 compatible. by @iotamudelta in #137
Car revert by @gshtras in #140
Using the correct datatypes for streaming non-chat completions by @gshtras in #134
Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards by @maleksan85 in #138
[FIX] gfx90a typo fix by @maleksan85 in #142
wvsplitk templatized and better tuned for MI300 by @amd-hhashemi in #132
[Bugfix] Dockerfile.rocm by @zstreet87 in #141
Update test-template.j2 by @okakarpa in #145
Adding Triton implementations awq_dequantize and awq_gemm to ROCm by @rasmith in #136
Adding fp8 padding by @charlifu in #144
[Int4-AWQ] Torch Int-4 AWQ Dequantization and Configuration Options by @hegemanjw4amd in #146
buildkit requirement for building docker images by @hongxiayang in #149
cupy build fix for SWDEV-475036 by @hongxiayang in #147
fix outdated env for turning off triton flash attention by @hongxiayang in #151
Nccl env for performance by @hongxiayang in #152
Render experiments by @okakarpa in #159
Workaround PyTorch IPC handle issue by @wenkaidu in #161
rocm6.3 fix for docker build and debug option for gpu code by @maleksan85 in #157
Miscellaneous cosmetic changes by @mawong-amd in #166
V5.5 upstream merge rc by @gshtras in #167
fnuz support for fbgemm fp8 by @gshtras in #169
Fixing mypy after a rushed merge by @gshtras in #171

New Contributors

@gshtras made their first contribution in #7
@hongxiayang made their first contribution in #17
@cjatin made their first contribution in #16
@lcskrishna made their first contribution in #22
@shajrawi made their first contribution in #24
@liligwu made their first contribution in #46
@fxmarty made their first contribution in #28
@hthangirala made their first contribution in #51
@amd-hhashemi made their first contribution in #54
@wenkaidu made their first contribution in #27
@HaiShaw made their first contribution in #82
@maleksan85 made their first contribution in #116
@iotamudelta made their first contribution in #130
@zstreet87 made their first contribution in #141
@okakarpa made their first contribution in #145
@rasmith made their first contribution in #136
@hegemanjw4amd made their first contribution in #146

Full Changelog: v0.6.0...v0.6.0_rocm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0_rocm

What's Changed

New Contributors

Contributors