-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption with CUDA hijack off #1782
Comments
Here's the driver on the node in question. I guess it's too old to have $ srun -n 1 -N 1 nvidia-smi
Fri Oct 25 11:28:03 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ That was my last idea. Suggestions? Edit: actually, I guess the driver is distinct from the CUDA version. Can someone check if this driver has |
It does not, you need to compile with at least 12.5 and the driver needs to be at least r550 I believe. I don't think cuCtxRecordEvent is the problem here. My guess is there is some race happening somewhere that the hijack makes less likely somehow. I would try to look at the address that is causing the failure and backtrack where it came from and why it's invalid. |
@elliottslaughter Can you run with |
@elliottslaughter any updates on this? |
Unfortunately, I need this working sooner rather than later, so I will probably hack the build so I can turn the hijack back on. I'll try to find some time to respond to the specific debug suggestion as well, but first priority is to get the app up and running. |
This is a follow-on from #1682. I'm building S3D on a variety of machines, and the behavior I currently see is:
In the case with the CUDA hijack off, errors look like:
Because it goes away with hijack (at least on Sapling), this smells like a synchronization issue.
What's notable to me here is that Frontier and Perlmutter (with hijack off) should be following the same code paths. HIP, as you may recall, has never really had a hijack, so users have always been required to query the task HIP stream for kernel launches. I have now gone and unified the code so that, as much as possible, we are running identical code in the CUDA case. It should be difficult or impossible for the HIP code to be obtaining the task stream but not doing so in the CUDA case. However, we still hit the issue above.
Since I suspected a synchronization issue, I went and commented out every instance of
set_task_ctxsync_required(false)
in the application. If I understand correctly, this should force a synchronization after every task. This is the primary difference between hijack and non-hijack modes, so it seems like the more likely culprit. To be really sure, I also applied the following diff to Realm:If I understand correctly, this ensures that we do not hit this code path, anywhere in the application. But after rebuilding I'm still hitting the error above.
At this point, I wonder if
cuCtxRecordEvent
(from #1730 (comment)) is somehow not having the behavior we expect? Again, I don't see what else could be different between the hijack and non-hijack builds. We are running the same application code and Legion version. In the case of Sapling, I can literally rebuild with one flag set.Is there a way to shut off the
cuCtxRecordEvent
code path and just do a plain oldcuCtxSynchronize
? Unless someone else has another suggestion, this seems like the next thing to try.@muraj for visibility.
The text was updated successfully, but these errors were encountered: