You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently a reduction across columns vs rows is slow since it naively takes in a transpose operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.
We experimented with an einsum transpose, followed by a reduction, and the results were as follows on an A30:
The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.
The text was updated successfully, but these errors were encountered:
Currently a reduction across columns vs rows is slow since it naively takes in a
transpose
operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.We experimented with an
einsum
transpose, followed by a reduction, and the results were as follows on an A30:permute/sum
einsum/sum
The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.
The text was updated successfully, but these errors were encountered: