Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about ARF #23

Open
grimoire opened this issue Jun 7, 2023 · 0 comments
Open

Question about ARF #23

grimoire opened this issue Jun 7, 2023 · 0 comments

Comments

@grimoire
Copy link

grimoire commented Jun 7, 2023

Hi
I am a little bit confused with ARF cuda kernel.

) {
// index
const int n = blockIdx.x * blockDim.x + threadIdx.x;
if (n < count) {
const uint16_t l = n % nEntry;
const uint16_t j = (n / nEntry) % nInputPlane;
const uint16_t i = n / nEntry / nInputPlane;
const scalar_t val = weight_flatten[n];
for (uint16_t k = 0; k < nRotation; k++) {
const uint16_t index = (uint16_t)indices_flatten[l][k] - 1;
output[i][k][j][index] = val;
}
}
}

Let's say, assume thread 0 and thead 1 has:

i0 == i1
j0 == j1
k0 == k1
index0 == index1

So output[i][k][j][index]. I think that is not expected. Different threads write data to the same memory address leading to unpredictable results.

Did I misunderstand the implementation?

PS: Is yzhou.work still available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant