-
-
Notifications
You must be signed in to change notification settings - Fork 75
Closed
Labels
bugSomething doesn't quite look rightSomething doesn't quite look right
Description
Describe the bug
cuda-memcheck reports scrolling errors on example-mnist-classification like this:
========= Invalid __global__ write of size 4
========= at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
========= by thread (191,0,0) in block (0,0,0)
========= Address 0x7fd319043efc is out of bounds
To Reproduce
Steps to reproduce the behaviour:
cargo buildcuda-memcheck target/debug/example-mnist-classification mnist linear
Expected behavior
No errors.
Please complete the following information:
- System: Manjaro Linux
- Version: Git commit 854043d
- Rust: rustc 1.62.1 (e092d0b6b 2022-07-16)
- Environment:
- Backends (if relevant):
- cuda:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P8 5W / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1936 G /usr/lib/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Additional context
Note that running example-mnist-classification without cuda-memcheck works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with CUDA_ERROR_ILLEGAL_ADDRESS when trying to copy from GPU to host. Not sure it's the same issue, but seems related.
Metadata
Metadata
Assignees
Labels
bugSomething doesn't quite look rightSomething doesn't quite look right