-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL Spec][Joint Matrix] Add a new overload for joint_matrix_apply to be able to return result into a different matrix #13153
base: sycl
Are you sure you want to change the base?
Conversation
…o be able to return result into a different matrix
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
@gmlueck, can you please review this? |
…able to return result into a different matrix (#13151) Currently, CUDA code that use this pattern: for (int i = 0; i < c_frag.num_elements; i++) { c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i]; } cannot be migrated to SYCL joint matrix. This added overload addresses this. Spec API is added here #13153
joint_matrix_apply(sg, C, D, [=](const T &x, T &y) { | ||
y = x * alpha; | ||
}); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem with this API, but I wonder if it should be more general. For example:
-
Is there a reason that one matrix must be the "source" and the other the "destination"? Is it hard to support the case where the lambda writes to elements in both matrices?
-
I think it makes sense that
Rows
andCols
must be the same for both matrices. What aboutUse
andLayout
, though? Is there a reason these can't be different for the two matrices? What about the typeT
? Could that be different? -
Is there a reason to limit this to just two matrices? You could imagine this API taking a parameter pack, allowing an arbitrary number of matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I relaxed the definition to not require read or write assumptions. Both matrices can be either read or written into.
-
I think it is safer to require the same
use
andlayout
. To change theuse
orlayout
,joint_matrix_copy
can be used https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc#copy.
T
should be the same. This is because there are different casts that can be used depending onT
. It is better to have the user specify the right cast in the lambda. -
Using a parameter pack will require the lambda to also use a parameter pack. Also, how do we pack the wi_data elements and pass them to the lambda (see line 22 in https://godbolt.org/z/E1sT5oY3P).
I don' think passing a variadic number will be useful as the user can implement any number of arguments with the two arguments version. As an optimization, we can also add a three arguments version to avoid using intermediate matrices in the case of ternary operation like C=A*B.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it is safer to require the same
use
andlayout
. To change theuse
orlayout
,joint_matrix_copy
can be used https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc#copy.
T
should be the same. This is because there are different casts that can be used depending onT
. It is better to have the user specify the right cast in the lambda.
I don't understand why it is "safer" to require use
, layout
, and T
to be the same for the two matrices. In what sense is it safer? What use cases do you have in mind for the form that takes two matrices? For these use cases, will it always be the case that use
, layout
, and T
are the same?
Maybe there is an implementation problem. The current implementation probably assumes that the distribution of matrix elements across work-items will be the same for the two matrices. For example, I suspect your implementation assumes that wi_data[j]
for work-item I
has the same Row / Col coordinate in both the A and B matrices. Does this assumption become invalidated if use
, layout
, or T
are different for A and B?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, if use, layout and T are different, there is no guarantee work items will have same number of data.
Yes, the CUDA use cases suggest use, layout and T are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
You should clarify the description:
- Clarify that
x
is an element fromjm0
andy
is an element fromjm1
. - Clarify that
x
andy
are guaranteed to have identical coordinates in their respective matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck, as a correction, if use and layout are different, there is not guarantee of ownership.
So changing type should be fine.
I will relax the type restriction to enable the use cases where conversion with extra arguments is needed.
@AlexeySachkov, Yes it is on my TODO list. I will give it higher priority to work on it this week |
`jm0` and `jm1` that have the same `use`, number of rows, number of | ||
columns, and `layout`. `jm0` and `jm1` can be read-only, write-only, | ||
or read and write arguments. The callable object must be invocable | ||
with two parameters `x` and `y` of type `T&` where `x` is an element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with two parameters `x` and `y` of type `T&` where `x` is an element | |
with two parameters `x` and `y` of types `T0&` and `T1&`, where `x` is an element |
Currently, CUDA code that use this pattern:
for (int i = 0; i < c_frag.num_elements; i++) {
c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
}
cannot be migrated to SYCL joint matrix.
This added overload addresses this limitation.