-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault after #3821 #3856
Comments
Will look into this! |
Thank you! |
Hmm... this bug is a bit strange since the segfault occurs in ucx. Just confirming these python_frontend tests only test single device behavior, right? |
That's right! |
Hmm.. I have tried to reproduce this multiple times, but the failing test runs fine locally. |
What a stubborn bug unfortunately... Just to think aloud: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/147631713/viewer#L2777 shows
So apparently a nullptr occurred without being caught immediately, causing the code to apply offset The callstack points to this cloning. So it might help to add more nullptr checks in that function or the functions it calls. |
I'd also try asan: https://github.com/NVIDIA/Fuser/wiki/Developer-guide#asan |
Thanks for the pointers! I'll try out the asan |
As expected, there is a memory bug (using a free'd heap pointer) that existed before the PR which happened to surface the bug. To reproduce:
with some print statements, it looks like the error happens when we try to clone a TensorView that had fallen out of scope. Since the original data was free'd from a unique pointer, there must be a bug where we saved a raw pointer and kept using it.
|
I merged #3821 too quickly. The CI indeed showed the same error.
To reproduce this,
The text was updated successfully, but these errors were encountered: