Skip to content

Conversation

delock
Copy link
Collaborator

@delock delock commented Oct 1, 2025

This PR add a blog/lab for study of zenflow and zero offload performance with DeepSpeed CPU core binding.

@delock delock requested a review from Copilot October 1, 2025 15:20
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive blog/lab study examining ZenFlow technology and its performance improvements with DeepSpeed CPU core binding. The study investigates how CPU core binding affects both ZeRO Offload and ZenFlow performance, documenting specific optimizations and measurement results.

  • Documents ZenFlow technology and its relationship with DeepSpeed CPU core binding
  • Presents performance testing results comparing different core binding strategies
  • Introduces improvements to ZenFlow's core binding mechanism developed in collaboration with ZenFlow authors

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@delock delock force-pushed the gma/zenflow_binding_study branch 2 times, most recently from a8a1be5 to 92d4d21 Compare October 1, 2025 15:26
@sfc-gh-truwase
Copy link
Collaborator

@delock thanks for the PR. Can you please modify https://github.com/deepspeedai/DeepSpeed/blob/master/docs/index.md in order to update the Latest News section of the home page?

delock and others added 19 commits October 3, 2025 11:23
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
This PR adds a blog post for SuperOffload. More specifically, the blog
covers the design and motivation behind SuperOffload, comparisons with
previous approaches, key experiences and insights, and guidance on
enabling and using SuperOffload.

See also:
[PR#7559](#7559) -
SuperOffload implementation.
[PR#990](deepspeedai/DeepSpeedExamples#990) -
Examples.

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Since `make format` will generate `venv` directory, we should add it to
`.gitignore`.

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
This PR improves state management for DeepCompile in the engine.

Previously, the system relied only on the config flag indicating whether
DeepCompile was enabled. However, DeepCompile is actually activated only
when `compile()` is called. This meant that if DeepCompile was enabled
in the config but `compile()` was never called, it could lead to invalid
internal states (as shown in #7598).

Since `enabled == True` should be interpreted as an option that modifies
the behavior of `compile()`, this PR introduces clearer state
management:
- If .compile() is not called, the DeepCompile config has no effect on
behavior. A one-time message is shown instead.
- A new state, DeepCompile activated, is introduced. This represents the
condition where DeepCompile is both enabled in the config and .compile()
has been called.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
PR #6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Polish SuperOffload blog post; minor grammar and style fixes

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
… when world size expansion. (#7599)

When the world size expands from 2 to 4, then convert to universal
checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file
`zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not
produced during the last execution.
For stage3, just load the first file, that is
`zero_pp_rank_0_mp_rank_00_model_states`.
The existing unit test
TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this
problem.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
delock and others added 5 commits October 3, 2025 11:23
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
@delock delock force-pushed the gma/zenflow_binding_study branch from 891454d to b21cb82 Compare October 3, 2025 03:24
@delock delock requested a review from tohtana as a code owner October 3, 2025 03:24
@delock
Copy link
Collaborator Author

delock commented Oct 3, 2025

Hi @sfc-gh-truwase the link from index page had been added, along with other wording fixes. Thanks for the suggestions!

@@ -0,0 +1,199 @@
# [LAB] Study of ZenFlow and ZeRO offload performance with DeepSpeed CPU core binding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@delock what does the tag [LAB] stand for? Is it necessary?

Copy link
Collaborator Author

@delock delock Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sfc-gh-truwase The intention of LAB tag is to differentiate it from a blog which describe a new feature and novel work. LAB would more focus on unique use case and perspective of DeepSpeed features. I would suggest reproduciabilty to be one of the criteria of LAB. Reader would want to reproduce LAB results and get new understanding of DeepSpeed features. This would be the purpose of a LAB. Want to hear your thoughts on a proper tag name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants