Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deepspeed fails with torch 2.5 due to module._parameters is a dict, no longer a OrderedDict #6961

Open
skydoorkai opened this issue Jan 20, 2025 · 3 comments
Assignees
Labels
bug Something isn't working training

Comments

@skydoorkai
Copy link

Describe the bug
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1904, in forward
   module._parameters._in_forward = False
AttributeError: 'dict' object has no attribute

To Reproduce
Steps to reproduce the behavior:
use torch 2.5 with ds.

torch 2.4 uses OrderedDict, which can add _in_forward attribute.
torch 2.5 uses dict for _parameters, and attribute adding is not supported.

Expected behavior
No failure.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

@skydoorkai skydoorkai added bug Something isn't working training labels Jan 20, 2025
@manojks1999
Copy link

@skydoorkai can I work on this?

@loadams
Copy link
Contributor

loadams commented Jan 21, 2025

Hi @skydoorkai - some of our CI run with torch 2.5 without failures, can you confirm the torch version or if you're on the latest nightly build?

Edit: I see our latest tests with torch-nightly are passing as well, so I'll be curious on your version or if you are hitting something that isn't covered in the unit tests.

@manojks1999, are you looking to contribute to DeepSpeed?

@loadams loadams self-assigned this Jan 21, 2025
@loadams
Copy link
Contributor

loadams commented Jan 27, 2025

@skydoorkai - following up on this, could you provide the information from above?

@manojks1999 - are you able to repro this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants