-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training killed without error shorting after starting #94
Comments
same question, have you solve this? |
I am training the model on two nodes each with 4 A100 GPUs with the batch size of 32. Shortly after starting the training, the process gets killed. Any clue what might be the reason?
I have added number of nodes by adding
num_nodes: 2
to thetrain_cldm.yaml
and requested a pull afterwards.I executed the following command:
python train.py --config configs/train_cldm.yaml
Here are the output of GPU and CUDA checking right after importing pytorch.
output:
The text was updated successfully, but these errors were encountered: