Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Use CUDA event for measuring elasped time #88

Open
xrsrke opened this issue Mar 2, 2024 · 0 comments
Open

[Feature] Use CUDA event for measuring elasped time #88

xrsrke opened this issue Mar 2, 2024 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xrsrke
Copy link
Member

xrsrke commented Mar 2, 2024

As mentioned in the MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 8

we develop a perfor- mance analysis tool that records the execution time of critical code segments on each machine rank during a run. In contrast to previous tools such as the torch profiler or the Megatron- LM timer, our tool times events based on the CUDA events method. This approach minimizes the need for CUDA syn- chronization, thus preventing performance degradation, allow- ing us to consistently run it in our production training jobs

Use torch.cuda.Event for measuring elapsed time minimize CUDA synchronization compared to time.time() [link]

@xrsrke xrsrke added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Mar 2, 2024
@xrsrke xrsrke changed the title [Feature] Replace nanotron's timer with CUDA event? [Feature] Use CUDA event for measuring elasped time Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant