Skip to content

fix(deepspeed): chunk ZeRO-3 missing-key param gather to avoid OOM#46918

Open
itxsamad1 wants to merge 1 commit into
huggingface:mainfrom
itxsamad1:fix/zero3-chunked-missing-keys-init
Open

fix(deepspeed): chunk ZeRO-3 missing-key param gather to avoid OOM#46918
itxsamad1 wants to merge 1 commit into
huggingface:mainfrom
itxsamad1:fix/zero3-chunked-missing-keys-init

Conversation

@itxsamad1

Copy link
Copy Markdown

Summary

  • Fixes OOM during from_pretrained under DeepSpeed ZeRO-3 when many missing parameters need initialization (e.g. large sparse MoE models like MiniMax-M3).
  • Instead of gathering all uninitialized parameters in one GatheredParameters context, gather them in bounded chunks so peak rank-0 memory stays bounded.

Context

When loading large sparse MoE models under ZeRO-3, _initialize_missing_keys coalesced every uninitialized parameter into a single GatheredParameters all-gather. For models with packed expert weights this can re-materialize hundreds of GB on rank 0 and OOM.

Test plan

Fixes #46822

When initializing missing weights under DeepSpeed ZeRO-3, gathering all uninitialized parameters in a single GatheredParameters context can OOM on large sparse MoE models. Gather parameters in bounded chunks instead so peak rank-0 memory stays bounded by the chunk size.

Fixes huggingface#46822
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ZeRO-3 zero.Init does not partition composite minimax_m3_vl language submodule -> OOM on multi-GPU load

1 participant