Enable grad scaler for ZeRO-0 + torch.autocast path #7619

tohtana · 2025-10-03T07:00:28Z

Currently, the DeepSpeed engine does not enable the grad scaler for the ZeRO-0 and torch.autocast path, even when dtype is set to fp16. This leads to errors in tests when we replace our hard-coded tolerances with PyTorch’s standard tolerances (Thank you @stas00 for you suggestion regarding the previous PR).

This PR enables the grad scaler for this path to improve accuracy, and refactors the tests to simplify validation by using torch.testing.assert_close. The tests now rely on PyTorch’s standard (and stricter) tolerances, and they still pass.

Signed-off-by: Masahiro Tanaka <[email protected]>

@stas00

Currently, the DeepSpeed engine does not enable the grad scaler for the ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This leads to errors in tests when we replace our hard-coded tolerances with PyTorch’s [standard tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close) (Thank you @stas00 for you suggestion regarding the previous PR). This PR enables the grad scaler for this path to improve accuracy, and refactors the tests to simplify validation by using `torch.testing.assert_close`. The tests now rely on PyTorch’s standard (and stricter) tolerances, and they still pass. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Ma, Liangliang <[email protected]>

tohtana and others added 5 commits October 2, 2025 16:20

show mismatching values when test fails

43526d6

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/improve_log_dc_test_failure

a448ac1

include mismatch value in exception message

86007af

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/enable_gradscaler_z0

18a65a5

enable gradsclaer for z0+torch.autocast

e6bc098

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana changed the title ~~Enable grad scaler for ZeRO-0~~ Enable grad scaler for ZeRO-0 + torch.autocast path Oct 3, 2025

tohtana marked this pull request as ready for review October 3, 2025 16:39

tohtana requested review from loadams and tjruwase as code owners October 3, 2025 16:39

remove hard-coded tolerances

f779b64

Signed-off-by: Masahiro Tanaka <[email protected]>

sfc-gh-truwase approved these changes Oct 4, 2025

View reviewed changes

Merge branch 'master' into tohtana/enable_gradscaler_z0

d1cf560

sfc-gh-truwase enabled auto-merge (squash) October 4, 2025 12:59

sfc-gh-truwase merged commit 71d077d into deepspeedai:master Oct 4, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable grad scaler for ZeRO-0 + torch.autocast path #7619

Enable grad scaler for ZeRO-0 + torch.autocast path #7619

Uh oh!

tohtana commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable grad scaler for ZeRO-0 + torch.autocast path #7619

Enable grad scaler for ZeRO-0 + torch.autocast path #7619

Uh oh!

Conversation

tohtana commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Oct 3, 2025 •

edited

Loading