You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we found some strange during using Dataloader2. Here's some details about the issue.
We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
In HuggingFace training, it will call evaluation every traininig_args.eval_steps training steps.
I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using IterableDataPipe with ShardingFilterIterDataPipe
The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)
Can you help provide some context on what could be the root cause and how to fix this? Thanks!
Log:
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
| 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
| 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
| 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
| 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
| 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
| 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
| 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
| 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
| 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
| 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
| 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
| 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
| 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
| 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)
@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.
🐛 Describe the bug
Hi, we found some strange during using Dataloader2. Here's some details about the issue.
traininig_args.eval_steps
training steps.IterableDataPipe
withShardingFilterIterDataPipe
Can you help provide some context on what could be the root cause and how to fix this? Thanks!
Log:
Versions
The text was updated successfully, but these errors were encountered: