Release v0.4.0 · ROCm/vllm

What's Changed

Features integration without fp8 by @gshtras in #7
Layernorm optimizations by @mawong-amd in #8
Bringing in the latest commits from upstream by @mawong-amd in #9
Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes by @mawong-amd in #12
add mi300 fused_moe tuned configs by @divakar-amd in #13
Correctly calculating the same value for the required cache blocks num for all torchrun processes by @gshtras in #15
[ROCm] adding a missing triton autotune config by @hongxiayang in #17
make the vllm setup mode configurable and make install mode as defaul… by @hongxiayang in #18
enable fused topK_softmax kernel for hip by @divakar-amd in #14
Fix ambiguous fma call by @cjatin in #16
Rccl dockerfile updates by @mawong-amd in #19
Dockerfile improvements: multistage by @mawong-amd in #20
Integrate PagedAttention Optimization custom kernel into vLLM by @lcskrishna in #22
Updates to custom PagedAttention for supporting context len upto 32k. by @lcskrishna in #25
Update max_context_len for custom paged attention. by @lcskrishna in #26
Update RCCL, hipBLASLt, base image in Dockerfile.rocm by @shajrawi in #24
Adding fp8 gemm computation by @charlifu in #29
fix the model loading fp8 by @charlifu in #30
Update linear.py by @gshtras in #32
Update base docker image with Pytorch 2.3 by @charlifu in #35

Full Changelog: v0.3.3...v0.4.0