Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slurm multinode example #3229

Closed

Conversation

ffrancesco94
Copy link

@ffrancesco94 ffrancesco94 commented Nov 8, 2024

What does this PR do?

The SLURM multinode submit script doesn't work for multiple reasons (see also #1239 which is still open). This PR aims at solving some of those issues, namely:

  • Typo in the $CMD command
  • $SLURM_NNODES was recently deprecated in favour of $SLURM_JOB_NUM_NODES
  • If multinode setup involves multiple GPUs, it has to be enforced with --multi-gpu and the rank of each process has to be set with --machine-rank
  • If multiple GPUs are present, the complete_nlp_example doesn't handle distributed evaluation correctly. In particular, either the evaluate GLUE mrpc metric has to be loaded with the --num_process and --process_id flags (which will subsequently fail due to several breakages due to recent datasets  evaluate#542 & related) or metric computation has to happen only on the main process. This PR goes for the second.

Fixes #3206

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings. NA
  • Did you write any new necessary tests? NA

Who can review?

@muellerzr or @SunMarc

Compute metric with evaluate from main process only, avoiding bug in multinode evaluate.
Enforce --multi_gpu on multiple nodes. Moreover, make sure that each rank gets correctly addressed based on the $SLURM_PROCID. $SLURM_NNODES has now been deprecated and replaced by $SLURM_JOB_NUM_NODES. Fixed typo in $CMD as well.
@SunMarc SunMarc requested a review from muellerzr November 15, 2024 16:52
Comment on lines -249 to +256
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
if accelerator.is_main_process:
# Computing metrics in a distributed manner requires calling evaluate.load() with the
# n_process and process_id arguments. However, the metric.add_batch() step will fail
# due to a bug with datasets and evaluate (see https://github.com/huggingface/evaluate/issues/542)
# and related
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaking is a first for me, let me try to repr this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multinode, multigpu example fails
3 participants