Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.errors.ChildFailedError(exitcode: -9)反复在同一个进度出现 #770

Open
2 tasks done
a241s opened this issue Jan 21, 2025 · 3 comments
Assignees

Comments

@a241s
Copy link

a241s commented Jan 21, 2025

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在lora微调的过程中,总是会准确的在进度为40%的时候报错相同的错误
报错信息如下:
torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 1322) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : (忽略)
host : autodl-container-8a9c4bbf6f-3c104f54
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1322)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1322

期望行为 | Expected Behavior

继续训练而不因这个error而停止

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu 22.04.1 LTS
- Python:3.10.8
- Transformers:4.40.0
- PyTorch:2.3.1+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

No response

@YuzaChongyi
Copy link
Collaborator

请问是用的哪一份微调代码进行训练的呢? 目前的信息无法定位,如果稳定在同一个阶段出现比较像是数据问题

@a241s
Copy link
Author

a241s commented Jan 26, 2025

请问是用的哪一份微调代码进行训练的呢? 目前的信息无法定位,如果稳定在同一个阶段出现比较像是数据问题

lora脚本,最近推进到了50%,exitcode 变为 1

@qyc-98
Copy link
Collaborator

qyc-98 commented Feb 6, 2025

您好 可以试试其他的数据先测试一下 目前看起来像是数据的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants