Lightning Fabric fails in AWS sagemaker jupyter notebook #20178
Unanswered
lrnilingy
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear All - I am a new user of fabric, and I tried to make it work in AWS Sagemaker Jupyter notebook.
The instance type is
ml.g5.12xlarge
which has 4 GPUs (NVIDIA A10G). It has a preinstalled torch conda environment with CUDA drivers & torch 2.2, etc.However, after I made the changes following the documentation. I ran all the cells of the jupyter notebook and got errors from
fabric.launch(train)
:I am confused since cuda/gpu/instance should not be a problem. In fact, I was using Huggingface Accelerate and that package worked on this instance.
Do I miss something? Thank you very much.
Some environment info:
Yesterday I was using
lightning v2.3.3
and it failed as well...Beta Was this translation helpful? Give feedback.
All reactions