CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

ryanz22 · 2023-01-12T00:57:41Z

After cloning the repo and installing all required dependencies, all unit tests can be run correctly on CPU as below.

poetry run python -m unittest discover -s chirp/tests -p "*test.py"

But CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment. Here is my CUDA environment

GPU: Nvidia RTX 3090
Ubuntu: 22.10
Driver version: 525.60.13
Python version: 3.10
cudatoolkit version: 11.8
cudnn version: 8.6
jax version: 0.4.1
jaxlib version: 0.4.1+cuda11.cudnn86
flax version: 0.6.3

The first issue I ran into is the OOM failure, so I followed the instruction to use CPU-only Tensorflow.
GPU memory allocation

tests/sep_train_test.py TrainSeparationTest.test_eval_one_step works fine.

PYTHONPATH=. python chirp/tests/sep_train_test.py TrainSeparationTest.test_eval_one_step

but tests/sep_train_test.py TrainSeparationTest.test_train_one_step reports CUDA_ERROR_ILLEGAL_ADDRESS error consistently in my CUDA environment.

PYTHONPATH=. python chirp/tests/sep_train_test.py TrainSeparationTest.test_train_one_step

Error message:

I0111 16:41:57.470412 140120040671040 pipeline.py:742] Splitting batch across 1 devices, with local device count 1.
I0111 16:42:07.193832 140120040671040 utils.py:33] Checkpoint.restore_or_initialize() ...
I0111 16:42:07.193934 140120040671040 utils.py:33] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() ...
I0111 16:42:07.194855 140120040671040 checkpoint.py:508] /tmp/tmpl5pjxa1ptrain_dir-0 not in []
I0111 16:42:07.194904 140120040671040 utils.py:43] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() finished after 0.00s.
I0111 16:42:07.194927 140120040671040 checkpoint.py:346] Storing initial version.
I0111 16:42:07.194949 140120040671040 utils.py:33] Checkpoint.save() ...
I0111 16:42:07.195027 140120040671040 checkpoint.py:304] Storing next checkpoint '/tmp/tmpl5pjxa1ptrain_dir-0/ckpt-1'
I0111 16:42:07.219635 140120040671040 utils.py:43] Checkpoint.save() finished after 0.02s.
I0111 16:42:07.219720 140120040671040 utils.py:43] Checkpoint.restore_or_initialize() finished after 0.03s.
/home/ryanz/projects/ml/source-separation/chirp/chirp/models/metrics.py:188: FutureWarning: The sym_pos argument to solve() is deprecated and will be removed in a future JAX release. Use assume_a='pos' instead.
  return scipy.linalg.solve(
2023-01-11 16:42:13.886039: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-01-11 16:42:13.886061: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:1112] Error waiting for event in stream: error recording waiting for CUDA event on stream 0x562ab718b7a0; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886070: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1159] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7f6d38002560; GPU src: 0x7f67441dfe00; size: 4=0x4
2023-01-11 16:42:13.886077: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:327] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886094: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0111 16:42:13.888874 140076977440448 asynclib.py:139] Error in producer thread for AsyncWriter
Traceback (most recent call last):
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/asynclib.py", line 135, in trap_errors
    return fn(*args, **kwargs)
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 44, in write_scalars
    values = [
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 45, in <listcomp>
    f"{k}={v:.6f}" if isinstance(v, float) else f"{k}={v}"
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 252, in __format__
    return format(self._value[()], format_spec)
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 487, in _value
    self._npy_value = np.asarray(self._arrays[0])  # type: ignore
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state
I0111 16:42:13.892304 140120040671040 utils.py:33] Checkpoint.save() ...
E0111 16:42:13.892458 140076985833152 asynclib.py:139] Error in producer thread for AsyncWriter

I tried to trace the cause by commenting out lines in this function and it is the separator.train() that causes the error.

  def test_train_one_step(self):
    config = self._get_test_config(use_small_encoder=True)
    ds, _ = self._get_test_dataset(
        'train',
        config,
    )
    model = separator.initialize_model(
        workdir=self.train_dir, **config.init_config)

    separator.train(
        *model, train_dataset=ds, logdir=self.train_dir, **config.train_config)
    ckpt = checkpoint.MultihostCheckpoint(self.train_dir)
    self.assertIsNotNone(ckpt.latest_checkpoint)

It could be something wrong with my CUDA environment but test_eval_one_step works fine so does other JAX code.

It would be great if Chirp team can share your CUDA environment setup.

Thanks,

Ryan

The text was updated successfully, but these errors were encountered:

sdenton4 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

ryanz22 commented Jan 12, 2023 •

edited

Loading

CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

Comments

ryanz22 commented Jan 12, 2023 • edited Loading

ryanz22 commented Jan 12, 2023 •

edited

Loading