Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

ds_secondary_tensor may be dirty during model loading or zero checkpointing for zero++.

  • 1 Loading model

My task is transformers SFT. In the transformers code, initialization is done using code like the following:

with deepspeed.zero.Init():
    model = xxx

After this, param is already a ds tensor, meaning both ds_tensor and ds_secondary_tensor exist. Then load_model is called to reload the model.

with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
    if torch.distributed.get_rank() == 0:
        module._load_from_state_dict(*args)

In GatheredParameters.__exit__, params[0].partition is called, and has_been_updated is set to True, indicating that data updates are needed. However, _partition_param_sec did not pass has_been_updated. This results in ds_secondary_tensor being dirty.

  • 2 Loading zero checkpoint

The zero checkpoint is loaded into fp16_partitioned_groups_flat, meaning param.ds_tensor has been updated. However, the data in param.ds_secondary_tensor has not been updated. But the next allgather will use the dirty param.ds_secondary_tensor.

A dirty ds_secondary_tensor can lead to abnormal loss. After calling invalidate_secondary_tensor in _post_step, the loss returns to normal. This is why loss anomaly only occurs during beginning steps.

Relate issue: #7606

@zhengchenyu
Copy link
Contributor Author

This picture proves that the bug has been fixed. The experimental conditions for fix are exactly the same as those for bug in #7606. The only difference is that the executable code applies this pr.

截屏2025-11-27 11 10 55

@zhengchenyu zhengchenyu deleted the issue-7606 branch November 27, 2025 10:18
…ero checkpoint for zero++.

Signed-off-by: zhengchenyu <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@zhengchenyu thanks for PR. We are taking a look.

@zhengchenyu
Copy link
Contributor Author

zhengchenyu commented Nov 28, 2025

The unit test test_compile_zero.py::TestDeepCompile::test[True-1-dtype0] failed. This seems to be unrelated to this PR. Despite multiple tests on my own server, I am still unable to reproduce the failure...

@sfc-gh-truwase
Copy link
Collaborator

@zhengchenyu thanks for this PR. My opinion is that invalidating the secondary tensor is the correct solution both these cases. So I am aligned with your solution for Loading zero checkpoints.

For model loading and other cases of deepspeed.zero.GatheredParameters(...modifer_rank!=None), how about calling invalidate_secondary_tensor() here?

What do you think?

For context, ds_secondary_tensor is meant to be a cache of gathered params from a previous forward pass. Therefore for safety, it should be invalidated when model weights change in anyway.

@zhengchenyu
Copy link
Contributor Author

zhengchenyu commented Dec 3, 2025

@sfc-gh-truwase
I think your suggestion is correct. In fact, I initially solved the problem by using invalidate_secondary_tensor.

However, I found the root cause was that the _partition function didn't pass the has_been_updated parameter when calling self._partition_param_sec, causing ds_secondary_tensor to become dirty. Since broadcast was used to ensure each parameter was correct, it's safe not to call invalidate_secondary_tensor here.

In fact, we have two solutions to this problem:

  • (1) Invalidate the secondary tensor when model weights change anyway.

Your solution maintains consistent logic: if weight changes, invalidate the secondary tensor.

If that's the case, I think here might also need to invalidate the secondary tensor.

However, the drawback of this approach is that it wastes a useful secondary tensor.

  • (2) pass has_been_updated for _partition_param_sec in _partition.

This approach avoids wasting the secondary tensor when the parameter has already been broadcast.

In fact, I think both are ok, but I prefer (2). However, if you think we need to maintain consistent logic, I would change it to (1).

@zhengchenyu
Copy link
Contributor Author

zhengchenyu commented Dec 3, 2025

And do you mean the case deepspeed.zero.GatheredParameters(...modifer_rank=None)? If that's the case, I agree add invalidate_secondary_tensor here。Thanks for your suggestion.
But If the model is updated but modifier_rank is not specified, it won't be broadcast, should invalidate secondary tensor. However, this is actually incorrect usage. And ds_tensor may also be inconsistent

@sfc-gh-truwase
Copy link
Collaborator

But If the model is updated but modifier_rank is not specified, it won't be broadcast, should invalidate secondary tensor. However, this is actually incorrect usage. And ds_tensor may also be inconsistent

Yes, this would be incorrect usage but it is not the API responsibility to detect such cases. So let's not worry about it.

@sfc-gh-truwase
Copy link
Collaborator

And do you mean the case deepspeed.zero.GatheredParameters(...modifer_rank=None)? If that's the case, I agree add invalidate_secondary_tensor here。Thanks for your suggestion.

Thanks for making the change.

@sfc-gh-truwase
Copy link
Collaborator

@zhengchenyu I apologize I realize I gave you misleading information because I didn't read existing GathereredParameters.exit() carefully.

In summary, your current PR is fine as is. I will approve to unblock for merging.

I will explain a bit more below just for the records.

  1. For if self.src_rank is None: case: The code here is actually correct and does not need invalidate_secondary_tensor call. This is because the user has specified no parameter changes and thus secondary tensor is clean. It is possible for users to misuse this API like you pointed out, but let's not worry about that.
  2. For if self.src_rank != None: case: This is when secondary tensor becomes dirty. One option is to use invalidate_secondary_tensor or to propagate has_been_updated to _partition_param_sec which is your current solution. I agree with your current solution.

Apologies for the confusion and extra work.

@zhengchenyu
Copy link
Contributor Author

Thanks very much for your review!

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) December 3, 2025 21:24
@sfc-gh-truwase sfc-gh-truwase merged commit c069ceb into deepspeedai:master Dec 3, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants