-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Issues: NVIDIA/Megatron-LM
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?
#1333
opened Dec 22, 2024 by
IgorZan
[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing
#1330
opened Dec 20, 2024 by
thuwzt
[BUG]megatron-lm,with torchompile,The provided qkv memory layout is not supported!
#1329
opened Dec 20, 2024 by
qingshanxwx
[QUESTION] Why doesn't GPTDataset build a global shuffle index?
#1328
opened Dec 20, 2024 by
dynamicheart
[BUG] Precision issue caused by different token dispatchers in MoE training
#1327
opened Dec 17, 2024 by
qi7kuo
[BUG] FSDP requires torch optimizer, not transformer_engine or apex
#1322
opened Dec 15, 2024 by
prrathi
[QUESTION]Does Megatron support tracing computation graphs with torch.fx?
#1315
opened Dec 7, 2024 by
fy-j
[BUG] When using LLaVA with freeze-LM, training text only sample occurs error.
#1314
opened Dec 6, 2024 by
liveseongho
[QUESTION] How to specify the implementation of Attention?
#1313
opened Dec 6, 2024 by
renyinCheng001
[QUESTION]UnboundLocalError:local variable ‘output tensor’ referenced before assignmnet
#1311
opened Dec 5, 2024 by
zmtttt
[BUG] The problem of splitting transformer layers when pipeline parallelism cannot be evenly divided.
#1304
opened Nov 27, 2024 by
Baibaifan
[QUESTION] How to split the Transform layer when the pipeline is uneven?
#1303
opened Nov 27, 2024 by
renyinCheng001
[QUESTION] Why is the initialization of the router and experts different in the MoE part?
#1302
opened Nov 27, 2024 by
mxymxy77
[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm)
#1301
opened Nov 26, 2024 by
hgdhrt
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel
#1292
opened Nov 19, 2024 by
SeunghyunSEO
[QUESTION] How to convert torch_dist format checkpoint to torch format?
#1291
opened Nov 19, 2024 by
zhangyilalala
Previous Next
ProTip!
What’s not been updated in a month: updated:<2024-11-23.