Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix missing world_size in args_to_keep #66

Open
wants to merge 1 commit into
base: multi-query-attention
Choose a base branch
from

Conversation

mayank31398
Copy link

This fixes the missing world_size in the checkpoint saver python file.
Currently, it is being picked up from the checkpoint and we want it to be set to 1.

@RaymondLi0
Copy link
Collaborator

This has never caused an issue on my side. Can you provide more explanation? When is this an issue?

@mayank31398
Copy link
Author

@RaymondLi0 this doesn't cause an issue if you are using a job with just 1 GPU or something I guess.
But in my case, I have a dedicated node with 8 GPUs.
Which throws an error saying some global batch size should be a multiple of number of GPUs.
world_size is set to 8 in this case and we want to emulate it to be 1 to unshard :)

I am surprised that this has gone unnoticed.

@mayank31398
Copy link
Author

@RaymondLi0 any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants