Skip to content

Multi-gpu problem: Running stuck when -ntasks-per-node larger than 1 #170

Discussion options

You must be logged in to vote

Hi - for distributed training, you'll need to launch the script with srun, so the line in your batch file should be changed to "srun python /path/to/run_train.py [args...]". Can you give that a try with ntasks-per-node set to 2?

If you still have trouble, it may be due to the environment variables you're setting. The distributed environment will be configured automatically, so you shouldn't need to manually set the port/address/world size. (For reference, there's a Slurm template in the multi-GPU branch under scripts/distributed_example.sbatch.)

Data-wise, both hdf5 and xyz will work - the former might be faster.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ilyes319
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #168 on September 22, 2023 15:24.