Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev global rank #313

Merged
merged 6 commits into from
Nov 7, 2024
Merged

Dev global rank #313

merged 6 commits into from
Nov 7, 2024

Conversation

wiederm
Copy link
Member

@wiederm wiederm commented Nov 7, 2024

Pull Request Summary

This PR addresses an issue introduced in PR #308 where logging of the epoch time was intended to occur only on the main process (rank == 0). However, this led to a synchronization error that caused training to freeze after the first epoch. To resolve this, we adjusted the logging to ensure it only occurs on the main process without disrupting the training flow.

Key changes

  • Adjusted Logging Conditions: Updated specific logging calls to include rank_zero_only=True where necessary. This change ensures that logging occurs only on the main process without causing synchronization issues across processes.
  • Removed Redundant Logging Checks: Refactored the logging of certain metrics to avoid redundant if self.global_rank == 0 checks by using rank_zero_only directly in self.log() calls.
  • Ensured Sync Consistency: For logging operations that require distribution synchronization (e.g., gradients, learning rate), we used sync_dist=True for safe distributed handling, while maintaining rank_zero_only=True for process-specific logging.

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@wiederm wiederm self-assigned this Nov 7, 2024
@wiederm wiederm linked an issue Nov 7, 2024 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.48%. Comparing base (c1430f2) to head (8b23f18).

Additional details and impacted files

@wiederm wiederm merged commit fa4814f into main Nov 7, 2024
2 of 6 checks passed
@wiederm wiederm deleted the dev-global-rank branch November 7, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synchronization Error with Epoch Time Logging
2 participants