Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thuwzt
Copy link

@thuwzt thuwzt commented Dec 20, 2024

Fix for the issue #1330 : [BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

@taehwakkwon
Copy link

Thank you for fixing this issue. I have a question regarding the aux loss.

Although it appears to be doubled, I believe the actual training remains the same since recomputed is turned off. Aux loss is not accumulated to the optimizer. The reason I'm asking is that we previously had recomputed turned on for a period, but now we can turn it off. I'm confused about whether to double the moe-aux-loss-coeff. I don't think it's necessary to double it because loss value is added to aux_losses_tracker.

tracker[name]["values"][layer_number - 1] += loss.detach() # Aggregate the loss for the layer.
.

@thuwzt
Copy link
Author

thuwzt commented Jan 8, 2025

@taehwakkwon Yes, you understand correctly. This is merely a display bug related to the tracker and logger. The compute flow remains unaffected, regardless of whether --moe-layer-recompute is used or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants