Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove FSDP wrapping from sub-models. #34452

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

eljandoubi
Copy link
Contributor

What does this PR do?

Fixes #34113

Who can review?

Library:

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks for fixing the issue @eljandoubi ! Do you think there is a simpler way to handle this edge case @muellerzr ?

Comment on lines 2263 to 2286
# Remove FSDP wrapping from sub-models.
self.model = extract_model_from_parallel(self.model, recursive=True)

Copy link
Member

@SunMarc SunMarc Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use unwarp_model function in transformers instead. Also, why do we need to set recursive to True ? Also, please leave a comment above as this specific path is only to make it functional with auto_find_batch_size .

Copy link
Contributor Author

@eljandoubi eljandoubi Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_model does not provide access to the recursive argument. Auto-wrap policies wrap submodules with FSDP, and unwrap_model is unable to remove them. You can test this on the toy example from the PyTorch FSDP tutorial for rank=0 and world_size=1, then experiment with the line I provided in a notebook.

my_auto_wrap_policy = functools.partial(
        size_based_auto_wrap_policy, min_num_params=20000
    )
torch.cuda.set_device(rank)
model = Net().to(rank)
print(model)
fsdp_model = FSDP(model,
    auto_wrap_policy=my_auto_wrap_policy)
print(fsdp_model)
unwrap_model = unwarp_model(fsdp_model)
print(unwrap_model)

VS
You need to reinstantiates model and fsdp_model:

model = Net().to(rank)

fsdp_model = FSDP(model,
    auto_wrap_policy=my_auto_wrap_policy)

extract_model = extract_model_from_parallel(fsdp_model, recursive=True)
print(extract_model)

Copy link
Member

@SunMarc SunMarc Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about this function in transformers. It uses extract_model_from_parallel under the hood so it should be comparable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Euh I see.

@eljandoubi
Copy link
Contributor Author

@SunMarc @muellerzr Did you get a different result than I did?

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, can you add a test in tests/test_trainer.py for this?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left an suggestion for unwrap_model

@eljandoubi
Copy link
Contributor Author

@SunMarc I migrated to unwrap_model.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it if you're both ok with it @SunMarc @muellerzr

@ArthurZucker ArthurZucker removed their request for review November 5, 2024 12:41
@SunMarc
Copy link
Member

SunMarc commented Nov 5, 2024

Please rebase this PR on main in order to pass the CI @eljandoubi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP is not working with the Trainer
5 participants