forked from NVIDIA/cutlass
-
Notifications
You must be signed in to change notification settings - Fork 30
Add U8 copy operation for K16 MMA #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aacostadiaz
wants to merge
28
commits into
codeplaysoftware:sycl-develop
Choose a base branch
from
aacostadiaz:aacosta/packed-copy
base: sycl-develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+288
−126
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
a6c8e53
spirv APIs
jiyang1011 73bef6e
mma spirv api
jiyang1011 6e12cb6
Merge branch 'sycl-develop' into jiyang/spirv_api
jiyang1011 626fd13
Merge branch 'sycl-develop' into jiyang/spirv_api
jiyang1011 cf6a41b
Merge branch 'sycl-develop' into jiyang/spirv_api
jiyang1011 d9f8303
remove -1 from OCL API
jiyang1011 c1cddb6
Merge branch 'sycl-develop' into jiyang/spirv_api
aacostadiaz 5537fd7
rebase
aacostadiaz c89a875
Disable spirv functions for PVC
aacostadiaz 5e26dd3
move spirv definitions
aacostadiaz 8c67947
fix
aacostadiaz 1af7011
Merge branch 'sycl-develop' into jiyang/spirv_api
aacostadiaz 879eb35
Refactor
aacostadiaz 9864ab2
Fix cmake
aacostadiaz 39e549d
Re-enable test
aacostadiaz d6c9358
Fix mma builtin
aacostadiaz ec9d0a7
Fix copy builtin
aacostadiaz 7144422
Revert minor changes
aacostadiaz 3d30536
Merge branch 'sycl-develop' into jiyang/spirv_api
aacostadiaz 4bbaaa6
Use builtin for prefetch
aacostadiaz 304de17
Remove FP16 MMA with FP16 accumulator
aacostadiaz a2c45b1
Add U8 copy operation for K16 MMA
aacostadiaz 1e2595a
Merge remote-tracking branch 'codeplay/sycl-develop' into aacosta/pac…
aacostadiaz b962239
fix merge conflict
aacostadiaz d8e855e
Revert changes in the tests
aacostadiaz d0e2c94
Update GEMM FP8 example
aacostadiaz d346207
Merge branch 'sycl-develop' into aacosta/packed-copy
aacostadiaz ba60f3a
Merge branch 'sycl-develop' into aacosta/packed-copy
joeatodd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aacostadiaz,
Please help resolve a couple of doubts.
The
DstLayout
in atom traits for this copy atom isLayout<Shape <_16,Shape <_8, _2, _32>>, Stride<_16,Stride< _1,_128,_256>>>;
, which seems to correspond to plain layout. So, does this mean that initially, when the data would be copied from global memory, it'd be transformed into VNNI layout before writing to the registers, and would later be converted toDstLayout
? If yes, can you please point out where/how it's handled in the code?Also, I don't see any
shfl
based instructions in the generated assembly dump, so is it possible that the shuffle (for VNNI -> plain layout conversion) may not be happening directly vialane registers -> lane registers
(I understand this isn't possible on Nvidia GPUs, but is somehow possible on Intel GPUs, based on the documentation) butlane registers -> shared local memory -> lane registers
?Thanks!
cc @pengzhao-intel @yuankuns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Copy trait is used to describe how a copy operation works so that the rest of the code can understand it. It does not change how the actual copy operation works.
In this case, for the VNNI copies the transformation happens inside the builtin/spirv function. There is no transformation inside cutlass for that. We just use these builtin/spirv functions and the copy traits describe how these functions work.