Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train using 1 GPU #234

Open
amira-essawy opened this issue Feb 9, 2023 · 5 comments
Open

Unable to train using 1 GPU #234

amira-essawy opened this issue Feb 9, 2023 · 5 comments

Comments

@amira-essawy
Copy link

amira-essawy commented Feb 9, 2023

I am trying to run the train.py using 1 GPU using this command

python tools/train.py configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1 --cfg-options fold=1 percent=10

The training started and ran till 4000 epoch then stopped giving this error, I am facing this problem on COCO dataset and my custom dataset

2023-02-09 13:00:26,292 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations 2023-02-09 13:00:36,802 - mmdet.ssod - INFO - Exp name: cv3.py 2023-02-09 13:00:36,803 - mmdet.ssod - INFO - Iter [4000/1080000] lr: 1.000e-02, eta: 9598 days, 18:19:02, time: 15.415, data_time: 0.941, memory: 6573, ema_momentum: 0.9990, sup_loss_rpn_cls: 0.0315, sup_loss_rpn_bbox: 0.0125, sup_loss_cls: 0.0654, sup_acc: 97.9980, sup_loss_bbox: 0.0812, loss: 0.1906 Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 186, in main train_detector( File "/root/workspace/amiras/SoftTeacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], **kwargs) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train self.call_hook('after_train_iter') File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 38, in after_train_iter self._do_evaluate(runner) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 52, in _do_evaluate dist.broadcast(module.running_var, 0) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1399, in broadcast default_pg = _get_default_group() File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Is the code not compatible with only GPU?!

@yjcreation
Copy link

How do you solve it? I met the same problem.@amira-essawy

@amira-essawy
Copy link
Author

@yjcreation I changed some parameters in the base config file, this line:

evaluation = dict(type="SubModulesDistEvalHook", interval=4000)
I removed the type.

@ERGOWHO
Copy link

ERGOWHO commented Jun 1, 2023

Thank you, this really helps.

@tarunsharma1
Copy link

tarunsharma1 commented Jun 12, 2023

@amira-essawy @ERGOWHO were you able to successfully train and obtain good results using a single gpu?

@Re-dot-art
Copy link

I am trying to run the train.py using 1 GPU using this command

python tools/train.py configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1 --cfg-options fold=1 percent=10

The training started and ran till 4000 epoch then stopped giving this error, I am facing this problem on COCO dataset and my custom dataset

2023-02-09 13:00:26,292 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations 2023-02-09 13:00:36,802 - mmdet.ssod - INFO - Exp name: cv3.py 2023-02-09 13:00:36,803 - mmdet.ssod - INFO - Iter [4000/1080000] lr: 1.000e-02, eta: 9598 days, 18:19:02, time: 15.415, data_time: 0.941, memory: 6573, ema_momentum: 0.9990, sup_loss_rpn_cls: 0.0315, sup_loss_rpn_bbox: 0.0125, sup_loss_cls: 0.0654, sup_acc: 97.9980, sup_loss_bbox: 0.0812, loss: 0.1906 Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 186, in main train_detector( File "/root/workspace/amiras/SoftTeacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], **kwargs) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train self.call_hook('after_train_iter') File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 38, in after_train_iter self._do_evaluate(runner) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 52, in _do_evaluate dist.broadcast(module.running_var, 0) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1399, in broadcast default_pg = _get_default_group() File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Is the code not compatible with only GPU?!

Can a GPU run experiment also achieve good results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants