Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

zhengchenyu · 2025-11-27T02:42:37Z

ds_secondary_tensor may be dirty during model loading or zero checkpointing for zero++.

1 Loading model

My task is transformers SFT. In the transformers code, initialization is done using code like the following:

with deepspeed.zero.Init():
    model = xxx

After this, param is already a ds tensor, meaning both ds_tensor and ds_secondary_tensor exist. Then load_model is called to reload the model.

with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
    if torch.distributed.get_rank() == 0:
        module._load_from_state_dict(*args)

In GatheredParameters.__exit__, params[0].partition is called, and has_been_updated is set to True, indicating that data updates are needed. However, _partition_param_sec did not pass has_been_updated. This results in ds_secondary_tensor being dirty.

2 Loading zero checkpoint

The zero checkpoint is loaded into fp16_partitioned_groups_flat, meaning param.ds_tensor has been updated. However, the data in param.ds_secondary_tensor has not been updated. But the next allgather will use the dirty param.ds_secondary_tensor.

A dirty ds_secondary_tensor can lead to abnormal loss. After calling invalidate_secondary_tensor in _post_step, the loss returns to normal. This is why loss anomaly only occurs during beginning steps.

Relate issue: #7606

zhengchenyu · 2025-11-27T03:15:02Z

This picture proves that the bug has been fixed. The experimental conditions for fix are exactly the same as those for bug in #7606. The only difference is that the executable code applies this pr.

…ero checkpoint for zero++. Signed-off-by: zhengchenyu <[email protected]>

sfc-gh-truwase · 2025-11-27T20:48:26Z

@zhengchenyu thanks for PR. We are taking a look.

zhengchenyu · 2025-11-28T08:31:40Z

The unit test test_compile_zero.py::TestDeepCompile::test[True-1-dtype0] failed. This seems to be unrelated to this PR. Despite multiple tests on my own server, I am still unable to reproduce the failure...

sfc-gh-truwase · 2025-12-02T16:23:15Z

@zhengchenyu thanks for this PR. My opinion is that invalidating the secondary tensor is the correct solution both these cases. So I am aligned with your solution for Loading zero checkpoints.

For model loading and other cases of deepspeed.zero.GatheredParameters(...modifer_rank!=None), how about calling invalidate_secondary_tensor() here?

What do you think?

For context, ds_secondary_tensor is meant to be a cache of gathered params from a previous forward pass. Therefore for safety, it should be invalidated when model weights change in anyway.

zhengchenyu · 2025-12-03T01:44:47Z

@sfc-gh-truwase
I think your suggestion is correct. In fact, I initially solved the problem by using invalidate_secondary_tensor.

However, I found the root cause was that the _partition function didn't pass the has_been_updated parameter when calling self._partition_param_sec, causing ds_secondary_tensor to become dirty. Since broadcast was used to ensure each parameter was correct, it's safe not to call invalidate_secondary_tensor here.

In fact, we have two solutions to this problem:

(1) Invalidate the secondary tensor when model weights change anyway.

Your solution maintains consistent logic: if weight changes, invalidate the secondary tensor.

If that's the case, I think here might also need to invalidate the secondary tensor.

However, the drawback of this approach is that it wastes a useful secondary tensor.

(2) pass has_been_updated for _partition_param_sec in _partition.

This approach avoids wasting the secondary tensor when the parameter has already been broadcast.

In fact, I think both are ok, but I prefer (2). However, if you think we need to maintain consistent logic, I would change it to (1).

zhengchenyu · 2025-12-03T01:49:37Z

And do you mean the case deepspeed.zero.GatheredParameters(...modifer_rank=None)? If that's the case, I agree add invalidate_secondary_tensor here。Thanks for your suggestion.
But If the model is updated but modifier_rank is not specified, it won't be broadcast, should invalidate secondary tensor. However, this is actually incorrect usage. And ds_tensor may also be inconsistent

sfc-gh-truwase · 2025-12-03T13:23:41Z

But If the model is updated but modifier_rank is not specified, it won't be broadcast, should invalidate secondary tensor. However, this is actually incorrect usage. And ds_tensor may also be inconsistent

Yes, this would be incorrect usage but it is not the API responsibility to detect such cases. So let's not worry about it.

sfc-gh-truwase · 2025-12-03T13:24:18Z

And do you mean the case deepspeed.zero.GatheredParameters(...modifer_rank=None)? If that's the case, I agree add invalidate_secondary_tensor here。Thanks for your suggestion.

Thanks for making the change.

sfc-gh-truwase · 2025-12-03T13:53:29Z

@zhengchenyu I apologize I realize I gave you misleading information because I didn't read existing GathereredParameters.exit() carefully.

In summary, your current PR is fine as is. I will approve to unblock for merging.

I will explain a bit more below just for the records.

For if self.src_rank is None: case: The code here is actually correct and does not need invalidate_secondary_tensor call. This is because the user has specified no parameter changes and thus secondary tensor is clean. It is possible for users to misuse this API like you pointed out, but let's not worry about that.
For if self.src_rank != None: case: This is when secondary tensor becomes dirty. One option is to use invalidate_secondary_tensor or to propagate has_been_updated to _partition_param_sec which is your current solution. I agree with your current solution.

Apologies for the confusion and extra work.

zhengchenyu · 2025-12-03T13:58:12Z

Thanks very much for your review!

zhengchenyu requested review from loadams, tjruwase and tohtana as code owners November 27, 2025 02:42

zhengchenyu closed this Nov 27, 2025

zhengchenyu deleted the issue-7606 branch November 27, 2025 10:18

Fix that ds_secondary_tensor may be dirty when loading the model or z…

9214092

…ero checkpoint for zero++. Signed-off-by: zhengchenyu <[email protected]>

zhengchenyu reopened this Nov 27, 2025

zhengchenyu force-pushed the issue-7606 branch from 7f90ee5 to 9214092 Compare November 27, 2025 10:24

sfc-gh-truwase approved these changes Dec 3, 2025

View reviewed changes

Merge branch 'master' into issue-7606

d530a62

sfc-gh-truwase enabled auto-merge (squash) December 3, 2025 21:24

sfc-gh-truwase merged commit c069ceb into deepspeedai:master Dec 3, 2025
11 checks passed

zhengchenyu mentioned this pull request Dec 17, 2025

[BUG] The loss of the first step is too high when zero_hpz_partition_size is set. #7606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Uh oh!

zhengchenyu commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 27, 2025

Uh oh!

sfc-gh-truwase commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Dec 2, 2025

Uh oh!

zhengchenyu commented Dec 3, 2025 •

edited

Loading

Uh oh!

zhengchenyu commented Dec 3, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

zhengchenyu commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Uh oh!

Conversation

zhengchenyu commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 27, 2025

Uh oh!

sfc-gh-truwase commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Dec 2, 2025

Uh oh!

zhengchenyu commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengchenyu commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

sfc-gh-truwase commented Dec 3, 2025

Uh oh!

zhengchenyu commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengchenyu commented Nov 28, 2025 •

edited

Loading

zhengchenyu commented Dec 3, 2025 •

edited

Loading

zhengchenyu commented Dec 3, 2025 •

edited

Loading