-
Notifications
You must be signed in to change notification settings - Fork 2.1k
【GLM】subbatch performance and weight bug fix #2661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (5.88%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2661 +/- ##
==========================================
Coverage ? 29.63%
==========================================
Files ? 311
Lines ? 54682
Branches ? 0
==========================================
Hits ? 16204
Misses ? 38478
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
paddleformers/trainer/trainer.py
Outdated
saved_signal_path = os.path.join(output_dir, f"saved_signal_{dist.get_rank()}") | ||
|
||
if self.args.unified_checkpoint and self.args.offload_optim: | ||
if self.args.unified_checkpoint and (self.args.offload_optim or self.args.tensorwise_offload_optimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个会造成这保存optimizer参数的时候显存异常上涨,不建议这么写
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
examples/run_finetune.py
Outdated
os.environ["USE_CASUAL_MASK"] = "False" | ||
|
||
|
||
def mock_offload_optimizer(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unified_checkpoint_config: ignore_merge_optimizer
optim: adamw_custom
tensorwise_offload_optimizer: True
训练添加上述也可以做到offload optimizer降低显存,暂时先把optimizer相关修改删除,这个PR可以先合一版
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
moe_group=moe_group, | ||
) | ||
if hasattr(dist, "fleet") and dist.is_initialized() and expert_parallel_degree > 1: | ||
# for p in self.experts.parameters(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要要的注释直接删掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
检查一下整个文件注释掉的代码
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before submitting
tests
folder. If there are codecov issues, please add tests cases first.PR types
PR changes
Description
配置新增:
本地glm sft训练改动依赖pr:
sp:#2621
fused_loss修复:#2648