Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in cuDNN manager constructor #308

Closed
timmoon10 opened this issue Apr 30, 2018 · 8 comments
Closed

Error in cuDNN manager constructor #308

timmoon10 opened this issue Apr 30, 2018 · 8 comments
Labels

Comments

@timmoon10
Copy link
Collaborator

This is based on @davidHysom's comment in #306:

This lbann_exception is about to be thrown:/usr/workspace/wsb/hysom/DATA_STORE/lbann/src/utils/cudnn_wrapper.cpp 349 :: CUDA erro; status: 30 err string: unknown error

This is in the cudnn_manager ctor. I'd really like to get this working so I can get some additional results to Brian for the LDRD review(s). For pilot1 (merge_features) there's only a very slight execution time advantage to using data_store. However, that's running without cuda. I suspect I could show some advantage if I could run with cuda/, since file-reading/computation ratio would be more favorable.

@timmoon10 timmoon10 added the bug label Apr 30, 2018
@timmoon10
Copy link
Collaborator Author

timmoon10 commented Apr 30, 2018

The error seems to show up in cudaStreamCreate and error code 30 is cudaErrorUnknown. I'm perplexed by this error since cudnn_manager is initialized before the data store in lbann.cpp, so there should be no difference in how cudnn_manager's constructor is called.

@ndryden
Copy link
Collaborator

ndryden commented Apr 30, 2018

On Pascal, I've consistently seen this intermittent error when running LBANN, which looks related:

code was compiled with LBANN_HAS_CUDNN, and we are using cudnn
CUDA error: unknown error
Error at /usr/workspace/wsa/dryden1/lbann-ndryden/src/utils/cudnn_wrapper.cpp:349

**************************************************************************
 This lbann_exception is about to be thrown:CUDA error

 Am now attempting to print the stack trace ...
**************************************************************************
rank: 6 :: rank: 6 :: rank: 6 :: lbann::cudnn::cudnn_manager::cudnn_manager(lbann::lbann_comm*, int, bool)
rank: 6 :: dli_sname == NULL for: 0x4064b5 backtrace message was: /usr/workspace/wsa/dryden1/lbann-ndryden/./build/gnu.pascal.llnl.gov/install/bin/lbann() [0x4064b5]
rank: 6 :: demangling failed for: __libc_start_main
rank: 6 :: dli_sname == NULL for: 0x40700b backtrace message was: /usr/workspace/wsa/dryden1/lbann-ndryden/./build/gnu.pascal.llnl.gov/install/bin/lbann() [0x40700b]

It typically goes away if I just rerun LBANN, so I haven't thought too much about it.

@timmoon10
Copy link
Collaborator Author

timmoon10 commented Apr 30, 2018

This is probably a driver problem: cbuchner1/CudaMiner#104
Or we're not feeding the right flags into nvcc: cbuchner1/CudaMiner#162

@davidHysom
Copy link
Collaborator

Tim, this is no specific to data_store; same thing happens when I run without it.

@timmoon10
Copy link
Collaborator Author

What are you running on? Surface gpgpu (at least surface172) is running the current develop build without any problems.

@davidHysom
Copy link
Collaborator

davidHysom commented Apr 30, 2018 via email

@davidHysom
Copy link
Collaborator

davidHysom commented Apr 30, 2018 via email

@JaeseungYeom
Copy link
Contributor

JaeseungYeom commented May 1, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants