-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in cuDNN manager constructor #308
Comments
The error seems to show up in |
On Pascal, I've consistently seen this intermittent error when running LBANN, which looks related:
It typically goes away if I just rerun LBANN, so I haven't thought too much about it. |
This is probably a driver problem: cbuchner1/CudaMiner#104 |
Tim, this is no specific to data_store; same thing happens when I run without it. |
What are you running on? Surface gpgpu (at least surface172) is running the current develop build without any problems. |
I'm running on surface with gnu compiler.
This morning I tried compiling with clang and got errors.
I'll do a fresh checkout/clean build and let you know what happens.
…________________________________
From: Tim Moon <[email protected]>
Sent: Monday, April 30, 2018 1:53:30 PM
To: LLNL/lbann
Cc: Hysom, David A.; Mention
Subject: Re: [LLNL/lbann] Error in cuDNN manager constructor (#308)
What are you running on? Surface gpgpu (at least surface172) is running the current develop build without any problems.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#308 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AI8DH3JqYq4EevNcqpFCxmwP7H2rWyzTks5tt3nJgaJpZM4Ts4Qb>.
|
can't run fresh checkout/build due until I can get at least one node on surface.
gnu release build went through.
clang failed.
$ build_lbann_lc.sh --verbose --make-processes 12 --compiler clang
COMPILER=clang
COMPILER_VERSION=3.7.0
Typical error error (all seem related to hydrogen; and don't know why the error refers to):
In file included from /usr/workspace/wsb/hysom/TESTME2/lbann/build/clang.surface.llnl.gov/hydrogen/src/src/blas_like/level1/HilbertSchmidt.cpp:9:
In file included from /usr/workspace/wsb/hysom/TESTME2/lbann/build/clang.surface.llnl.gov/hydrogen/src/include/El-lite.hpp:13:
In file included from /usr/workspace/wsb/hysom/TESTME2/lbann/build/clang.surface.llnl.gov/hydrogen/src/include/El/core.hpp:18:
/usr/apps/gnu/4.8.5/lib64/gcc/x86_64-unknown-linux-gnu/4.8.5/../../../../include/c++/4.8.5/complex:1148:29: error: no matching member function for call to 'real'
__real__ _M_value += __z.real();
~~~~^~~~
…________________________________
From: Hysom, David A.
Sent: Monday, April 30, 2018 2:22:05 PM
To: LLNL/lbann; LLNL/lbann
Cc: Mention
Subject: Re: [LLNL/lbann] Error in cuDNN manager constructor (#308)
I'm running on surface with gnu compiler.
This morning I tried compiling with clang and got errors.
I'll do a fresh checkout/clean build and let you know what happens.
________________________________
From: Tim Moon <[email protected]>
Sent: Monday, April 30, 2018 1:53:30 PM
To: LLNL/lbann
Cc: Hysom, David A.; Mention
Subject: Re: [LLNL/lbann] Error in cuDNN manager constructor (#308)
What are you running on? Surface gpgpu (at least surface172) is running the current develop build without any problems.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#308 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AI8DH3JqYq4EevNcqpFCxmwP7H2rWyzTks5tt3nJgaJpZM4Ts4Qb>.
|
You should use at least clang 4.0.0.Also, you can't use clang 4.0 on gpu platforms because nvcc does not support it yet. At least that was the case last time I checked. You can use clang on quartz or catalyst.
Somehow, my siames runs using model parallel fc and lrn consistently hang on surface. It was fine until recently. It still runs on pascal but hangs on ray quite often.
Edit: I think my issue might just be from some bad node. Now, even my jobs using batchnorm hang on surface, which was running fine until yesterday morning. But still I cannot explain why it also hangs on ray.
|
This is based on @davidHysom's comment in #306:
The text was updated successfully, but these errors were encountered: