[common] Add max_t support for KV in THD #1370

cyanguwa · 2024-12-12T11:43:32Z

Description

This PR adds max_t support for KV when qkv_format=thd. This helps reduce the memory usage for MQA/GQA + THD cases.

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Please list the changes introduced in this PR:

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-12-12T11:44:15Z

/te-ci pytorch L0

add max_t for KV

ae8c960

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa requested a review from zlsh80826 December 13, 2024 03:46

cyanguwa added the 1.14.0 label Dec 13, 2024

zlsh80826 approved these changes Dec 16, 2024

View reviewed changes

cyanguwa merged commit f4f35c2 into NVIDIA:main Dec 17, 2024
29 checks passed