Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model #36572

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yao-matrix
Copy link

@yao-matrix yao-matrix commented Mar 6, 2025

fix issue: #36247

cause of this issue:

bart-large-cnn checkpoint only has model.decoder.embed_tokens.weight while load_state_dict, but in BartModel implementation, both model.decoder.embed_tokens.weight and model.encoder.embed_tokens.weight tie to the shared.weights which doesn't have corresponding real weights in checkpoints, so when tie_weights are called, things are back to a void shared.weights which in meta device, then lead to the model.to issue

@SunMarc @Rocketknight1 , I don't know whether this PR meet your standard since it changed modeling code, pls let me know your thoughts, thx.

Copy link

github-actions bot commented Mar 6, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@github-actions github-actions bot marked this pull request as draft March 6, 2025 06:11
@yao-matrix yao-matrix marked this pull request as ready for review March 6, 2025 06:15
@Rocketknight1
Copy link
Member

hi @yao-matrix, this causes several other BART tests to fail! You can see the failures in the CI at the bottom of this post.

Some of the failures are just code style - these are simple and can be fixed with make style or make fixup after pip install transformers[quality]. However, the failures in tests_torch indicate that this PR breaks BART weight loading, probably because the old shared tensor doesn't match up with the architecture after your PR. We won't be able to merge the PR unless those errors are resolved, and I'm not sure that will be possible while self.shared is removed.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to change the modeling to fix one specific checkpoint even though I know that this is one of the most used one. Maybe other checkpoints have the shared module instead. Instead, maybe we can try to update _tie_weights instead by tying self.shared to self.decoder.embed_tokens instead if self.decoder.embed_tokens is not on meta device ?

@yao-matrix
Copy link
Author

@Rocketknight1 @SunMarc , Marc's method is better, I used it to fix the issue, pls help review and comment, thx.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this ! Really appreciate it !

@SunMarc SunMarc requested a review from ArthurZucker March 7, 2025 14:03
@yao-matrix
Copy link
Author

@ArthurZucker , pls take a look, thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants