Skip to content

Conversation

@tohtana
Copy link
Contributor

@tohtana tohtana commented Oct 3, 2025

Currently, the DeepSpeed engine does not enable the grad scaler for the ZeRO-0 and torch.autocast path, even when dtype is set to fp16. This leads to errors in tests when we replace our hard-coded tolerances with PyTorch’s standard tolerances (Thank you @stas00 for you suggestion regarding the previous PR).

This PR enables the grad scaler for this path to improve accuracy, and refactors the tests to simplify validation by using torch.testing.assert_close. The tests now rely on PyTorch’s standard (and stricter) tolerances, and they still pass.

@tohtana tohtana changed the title Enable grad scaler for ZeRO-0 Enable grad scaler for ZeRO-0 + torch.autocast path Oct 3, 2025
@tohtana tohtana marked this pull request as ready for review October 3, 2025 16:39
Signed-off-by: Masahiro Tanaka <[email protected]>
@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) October 4, 2025 12:59
@sfc-gh-truwase sfc-gh-truwase merged commit 71d077d into deepspeedai:master Oct 4, 2025
12 checks passed
Liangliang-Ma pushed a commit to Liangliang-Ma/DeepSpeed that referenced this pull request Oct 13, 2025
Currently, the DeepSpeed engine does not enable the grad scaler for the
ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This
leads to errors in tests when we replace our hard-coded tolerances with
PyTorch’s [standard
tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close)
(Thank you @stas00 for you suggestion regarding the previous PR).

This PR enables the grad scaler for this path to improve accuracy, and
refactors the tests to simplify validation by using
`torch.testing.assert_close`. The tests now rely on PyTorch’s standard
(and stricter) tolerances, and they still pass.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Ma, Liangliang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants