You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I0111 16:41:57.470412 140120040671040 pipeline.py:742] Splitting batch across 1 devices, with local device count 1.
I0111 16:42:07.193832 140120040671040 utils.py:33] Checkpoint.restore_or_initialize() ...
I0111 16:42:07.193934 140120040671040 utils.py:33] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() ...
I0111 16:42:07.194855 140120040671040 checkpoint.py:508] /tmp/tmpl5pjxa1ptrain_dir-0 not in []
I0111 16:42:07.194904 140120040671040 utils.py:43] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() finished after 0.00s.
I0111 16:42:07.194927 140120040671040 checkpoint.py:346] Storing initial version.
I0111 16:42:07.194949 140120040671040 utils.py:33] Checkpoint.save() ...
I0111 16:42:07.195027 140120040671040 checkpoint.py:304] Storing next checkpoint '/tmp/tmpl5pjxa1ptrain_dir-0/ckpt-1'
I0111 16:42:07.219635 140120040671040 utils.py:43] Checkpoint.save() finished after 0.02s.
I0111 16:42:07.219720 140120040671040 utils.py:43] Checkpoint.restore_or_initialize() finished after 0.03s.
/home/ryanz/projects/ml/source-separation/chirp/chirp/models/metrics.py:188: FutureWarning: The sym_pos argument to solve() is deprecated and will be removed in a future JAX release. Use assume_a='pos' instead.
return scipy.linalg.solve(
2023-01-11 16:42:13.886039: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-01-11 16:42:13.886061: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:1112] Error waiting for event in stream: error recording waiting for CUDA event on stream 0x562ab718b7a0; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886070: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1159] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7f6d38002560; GPU src: 0x7f67441dfe00; size: 4=0x4
2023-01-11 16:42:13.886077: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:327] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886094: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0111 16:42:13.888874 140076977440448 asynclib.py:139] Error in producer thread for AsyncWriter
Traceback (most recent call last):
File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/asynclib.py", line 135, in trap_errors
return fn(*args, **kwargs)
File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 44, in write_scalars
values = [
File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 45, in <listcomp>
f"{k}={v:.6f}" if isinstance(v, float) else f"{k}={v}"
File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 252, in __format__
return format(self._value[()], format_spec)
File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 487, in _value
self._npy_value = np.asarray(self._arrays[0]) # type: ignore
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state
I0111 16:42:13.892304 140120040671040 utils.py:33] Checkpoint.save() ...
E0111 16:42:13.892458 140076985833152 asynclib.py:139] Error in producer thread for AsyncWriter
I tried to trace the cause by commenting out lines in this function and it is the separator.train() that causes the error.
After cloning the repo and installing all required dependencies, all unit tests can be run correctly on CPU as below.
But CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment. Here is my CUDA environment
GPU: Nvidia RTX 3090
Ubuntu: 22.10
Driver version: 525.60.13
Python version: 3.10
cudatoolkit version: 11.8
cudnn version: 8.6
jax version: 0.4.1
jaxlib version: 0.4.1+cuda11.cudnn86
flax version: 0.6.3
The first issue I ran into is the OOM failure, so I followed the instruction to use CPU-only Tensorflow.
GPU memory allocation
tests/sep_train_test.py TrainSeparationTest.test_eval_one_step works fine.
but tests/sep_train_test.py TrainSeparationTest.test_train_one_step reports CUDA_ERROR_ILLEGAL_ADDRESS error consistently in my CUDA environment.
Error message:
I tried to trace the cause by commenting out lines in this function and it is the separator.train() that causes the error.
It could be something wrong with my CUDA environment but test_eval_one_step works fine so does other JAX code.
It would be great if Chirp team can share your CUDA environment setup.
Thanks,
Ryan
The text was updated successfully, but these errors were encountered: