Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training crashes after 50 epochs #290

Open
peastman opened this issue Feb 22, 2024 · 3 comments
Open

Training crashes after 50 epochs #290

peastman opened this issue Feb 22, 2024 · 3 comments

Comments

@peastman
Copy link
Collaborator

My training runs always crash after exactly 50 epochs. Looking at the log, there are many repetitions of this error:

Exception in thread Thread-104 (_pin_memory_loop):
Traceback (most recent call last):
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
           ^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata

and then it finally exits with this error:

  File "/home/peastman/miniconda3/envs/torchmd-net2/bin/torchmd-train", line 33, in <module>
    sys.exit(load_entry_point('torchmd-net', 'console_scripts', 'torchmd-train')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/workspace/torchmd-net/torchmdnet/scripts/train.py", line 220, in main
    trainer.fit(model, data, ckpt_path=None if args.reset_trainer else args.load_model)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/lightning/pytorch/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/torchmd-net2/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1127667, OpType=ALLREDUCE, NumelIn=288321, NumelOut=288321, Timeout(ms)=1800000) ran for 1800800 milliseconds before timing out.

Any idea what could be causing it?

@RaulPPelaez
Copy link
Collaborator

Some users started seeing similar behavior to this, so I added this workaround to the README:

Some CUDA systems might hang during a multi-GPU parallel training. Try export NCCL_P2P_DISABLE=1, which disables direct peer to peer GPU communication.

Could it be the root of your issue too? I am assuming this i a multigpu training.

I do not remember the error being as consistent as you say (always 50 steps), so it might be unrelated.
OTOH the error suggests a relation to pinned memory, which makes me think of this:

dl = DataLoader(
dataset=dataset,
batch_size=batch_size,
num_workers=self.hparams["num_workers"],
persistent_workers=True,
pin_memory=True,
shuffle=shuffle,
)

It would be great if you could try persistent_workers=False and pin_memory=False (separately) and report back.

@peastman
Copy link
Collaborator Author

Thanks! Yes, this is with multiple GPUs. I just started a run with persistent_workers=False. I'll let you know what happens.

@peastman
Copy link
Collaborator Author

Crossing my fingers, but I think persistent_workers=False fixed it. My latest training run is up to 70 epochs without crashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants