Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment #305

Closed
ryanz22 opened this issue Jan 12, 2023 · 0 comments

Comments

@ryanz22
Copy link

ryanz22 commented Jan 12, 2023

After cloning the repo and installing all required dependencies, all unit tests can be run correctly on CPU as below.

poetry run python -m unittest discover -s chirp/tests -p "*test.py"

But CUDA_ERROR_ILLEGAL_ADDRESS is thrown when running chirp test in CUDA environment. Here is my CUDA environment

GPU: Nvidia RTX 3090
Ubuntu: 22.10
Driver version: 525.60.13
Python version: 3.10
cudatoolkit version: 11.8
cudnn version: 8.6
jax version: 0.4.1
jaxlib version: 0.4.1+cuda11.cudnn86
flax version: 0.6.3

The first issue I ran into is the OOM failure, so I followed the instruction to use CPU-only Tensorflow.
GPU memory allocation

tests/sep_train_test.py TrainSeparationTest.test_eval_one_step works fine.

PYTHONPATH=. python chirp/tests/sep_train_test.py TrainSeparationTest.test_eval_one_step

but tests/sep_train_test.py TrainSeparationTest.test_train_one_step reports CUDA_ERROR_ILLEGAL_ADDRESS error consistently in my CUDA environment.

PYTHONPATH=. python chirp/tests/sep_train_test.py TrainSeparationTest.test_train_one_step

Error message:

I0111 16:41:57.470412 140120040671040 pipeline.py:742] Splitting batch across 1 devices, with local device count 1.
I0111 16:42:07.193832 140120040671040 utils.py:33] Checkpoint.restore_or_initialize() ...
I0111 16:42:07.193934 140120040671040 utils.py:33] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() ...
I0111 16:42:07.194855 140120040671040 checkpoint.py:508] /tmp/tmpl5pjxa1ptrain_dir-0 not in []
I0111 16:42:07.194904 140120040671040 utils.py:43] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() finished after 0.00s.
I0111 16:42:07.194927 140120040671040 checkpoint.py:346] Storing initial version.
I0111 16:42:07.194949 140120040671040 utils.py:33] Checkpoint.save() ...
I0111 16:42:07.195027 140120040671040 checkpoint.py:304] Storing next checkpoint '/tmp/tmpl5pjxa1ptrain_dir-0/ckpt-1'
I0111 16:42:07.219635 140120040671040 utils.py:43] Checkpoint.save() finished after 0.02s.
I0111 16:42:07.219720 140120040671040 utils.py:43] Checkpoint.restore_or_initialize() finished after 0.03s.
/home/ryanz/projects/ml/source-separation/chirp/chirp/models/metrics.py:188: FutureWarning: The sym_pos argument to solve() is deprecated and will be removed in a future JAX release. Use assume_a='pos' instead.
  return scipy.linalg.solve(
2023-01-11 16:42:13.886039: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-01-11 16:42:13.886061: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:1112] Error waiting for event in stream: error recording waiting for CUDA event on stream 0x562ab718b7a0; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886070: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1159] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7f6d38002560; GPU src: 0x7f67441dfe00; size: 4=0x4
2023-01-11 16:42:13.886077: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:327] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-01-11 16:42:13.886094: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0111 16:42:13.888874 140076977440448 asynclib.py:139] Error in producer thread for AsyncWriter
Traceback (most recent call last):
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/asynclib.py", line 135, in trap_errors
    return fn(*args, **kwargs)
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 44, in write_scalars
    values = [
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/clu/metric_writers/logging_writer.py", line 45, in <listcomp>
    f"{k}={v:.6f}" if isinstance(v, float) else f"{k}={v}"
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 252, in __format__
    return format(self._value[()], format_spec)
  File "/home/ryanz/miniconda3/envs/new-tf/lib/python3.10/site-packages/jax/_src/array.py", line 487, in _value
    self._npy_value = np.asarray(self._arrays[0])  # type: ignore
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state
I0111 16:42:13.892304 140120040671040 utils.py:33] Checkpoint.save() ...
E0111 16:42:13.892458 140076985833152 asynclib.py:139] Error in producer thread for AsyncWriter

I tried to trace the cause by commenting out lines in this function and it is the separator.train() that causes the error.

  def test_train_one_step(self):
    config = self._get_test_config(use_small_encoder=True)
    ds, _ = self._get_test_dataset(
        'train',
        config,
    )
    model = separator.initialize_model(
        workdir=self.train_dir, **config.init_config)

    separator.train(
        *model, train_dataset=ds, logdir=self.train_dir, **config.train_config)
    ckpt = checkpoint.MultihostCheckpoint(self.train_dir)
    self.assertIsNotNone(ckpt.latest_checkpoint)

It could be something wrong with my CUDA environment but test_eval_one_step works fine so does other JAX code.

It would be great if Chirp team can share your CUDA environment setup.

Thanks,

Ryan

@sdenton4 sdenton4 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants