How to do checkpointing when optimizer can only use distributed state dict #19374
Unanswered
RuABraun
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm doing multinode, multi GPU training and am using this optimizer, which requires using
distributed_state_dict()
andload_distributed_state_dict()
.I think the loading I could do by overwriting
load_from_checkpoint()
. But for saving it seems I might have to subclass the Strategy class? Is that correct or is there an easier way.Beta Was this translation helpful? Give feedback.
All reactions