Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader2 with FullSyncIterDataPipe throws error during initilization #1190

Open
chenxingyu-cs opened this issue Jun 19, 2023 · 3 comments

Comments

@chenxingyu-cs
Copy link

🐛 Describe the bug

Hi, we found some strange during using Dataloader2. Here's some details about the issue.

  • We are a long run training job with 8 AWS P4 nodes. It's using HuggingFace trainer.
  • In HuggingFace training, it will call evaluation every traininig_args.eval_steps training steps.
  • I overrided the HF trainer to use Dataloader2 with training, evaluation and test dataset loading. At the same time, on the dataset part, I'm using IterableDataPipe with ShardingFilterIterDataPipe
  • The issue that listed the log happens randomly. And most time it happens after the job runs for a long time (e.g. 20+ hours)

Can you help provide some context on what could be the root cause and how to fix this? Thanks!

Log:



  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
-- | -- | --
  | 2023-06-08T08:51:15.973-07:00 | return inner_training_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
  | 2023-06-08T08:51:15.973-07:00 | self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
  | 2023-06-08T08:51:15.973-07:00 | metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2932, in evaluate
  | 2023-06-08T08:51:15.973-07:00 | output = eval_loop(
  | 2023-06-08T08:51:15.973-07:00 | File "/workspace/mfive/mfive/trainer.py", line 236, in evaluation_loop
  | 2023-06-08T08:51:15.973-07:00 | for step, inputs in enumerate(dataloader):
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/dataloader2/dataloader2.py", line 46, in __next__
  | 2023-06-08T08:51:15.973-07:00 | next_val = next(self.dataloader._datapipe_iter) # type: ignore[arg-type]
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
  | 2023-06-08T08:51:15.973-07:00 | response = gen.send(None)
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torchdata/datapipes/iter/util/distributed.py", line 178, in __iter__
  | 2023-06-08T08:51:15.973-07:00 | self._process_group = dist.new_group(backend="gloo")
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3520, in new_group
  | 2023-06-08T08:51:15.973-07:00 | pg = _new_process_group_helper(
  | 2023-06-08T08:51:15.973-07:00 | File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  | 2023-06-08T08:51:15.973-07:00 | backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
  | 2023-06-08T08:51:15.973-07:00 | RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:176] bind: Address already in use
  | 2023-06-08T08:51:15.973-07:00 | This exception is thrown by __iter__ of FullSyncIterDataPipe(datapipe=CollatorIterDataPipe, timeout=1800)

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==0.991
[pip3] mypy-boto3-batch==1.26.103
[pip3] mypy-boto3-ec2==1.26.136
[pip3] mypy-boto3-iam==1.26.97
[pip3] mypy-boto3-s3==1.26.127
[pip3] mypy-boto3-sagemaker==1.26.141
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.1
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchsnapshot-nightly==2023.3.15
[pip3] torchvision==0.15.2
[pip3] torchx-nightly==2023.5.25
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchsnapshot-nightly     2023.3.15                pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] torchx-nightly            2023.5.25                pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi
@chenxingyu-cs
Copy link
Author

@ejguan Hi can you help provide some insights you have? Great thanks!

@ejguan
Copy link
Contributor

ejguan commented Jun 20, 2023

Are you running multiple DPP at the same time?

@chenxingyu-cs
Copy link
Author

@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants