[Dtype] Low-precision Blackwell Datatype Support #18027
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR focuses on supporting FP4/FP8 data types introduced in Blackwell architectures (sm_100).
TVM nd array stores subbyte data types in compact format, thus two FP4 would be stored in 1 byte. The size calculator for array allocator is modified accordingly.
Subtype arithmetic
The type
__nv_fp4_e2m1
from<cuda_fp4.h>
is a tag type and does not support pointer arithmetic. Accordingly, the compiler does not support index operations on an array declared with__nv_fp4_e2m1
directly. If any index operations likearr[0] + arr[1]
is desired, user should declare the array as vector type like__nv_fp4x2_e2m1
.For example, suppose user creates an array A of type
__nv_fp4_e2m1
with values[-1 2 0.5 -6 -6 -2 2 3 4 1 -3 4 -2 2...]
Printing out values of A[0], A[1], ... will show
This is because
__nv_fp4_e2m1
is only a tag type. When it advances pointer, it advance by 1-byte at a time, yielding the upper 4 bits in the packed memory buffer. As a result, we should avoid directly doing indexing on__nv_fp4_e2m1
for arithmetic operations.If user passes in
__nv_fp4_e2m1
nd array and perform indexing, we can convert it to__nv_fp4x2_e2m1
and recalculate the indices if possible, but this requires more careful handling in the lowering process.Thus, the original corresponding test case in
test_target_codegen_cuda_fp4.py
is removed.