How to do checkpointing when optimizer can only use distributed state dict #19374

RuABraun · 2024-01-30T18:13:04Z

RuABraun
Jan 30, 2024

I'm doing multinode, multi GPU training and am using this optimizer, which requires using distributed_state_dict() and load_distributed_state_dict().

I think the loading I could do by overwriting load_from_checkpoint(). But for saving it seems I might have to subclass the Strategy class? Is that correct or is there an easier way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do checkpointing when optimizer can only use distributed state dict #19374

{{title}}

Replies: 0 comments

Select a reply

How to do checkpointing when optimizer can only use distributed state dict #19374

RuABraun Jan 30, 2024

Replies: 0 comments

RuABraun
Jan 30, 2024