Skip to content

Refactor megatron to mcore_bridge#134

Merged
tastelikefeet merged 47 commits intomodelscope:mainfrom
tastelikefeet:feat/mbridge
Apr 3, 2026
Merged

Refactor megatron to mcore_bridge#134
tastelikefeet merged 47 commits intomodelscope:mainfrom
tastelikefeet:feat/mbridge

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Megatron-Core integration by offloading model configuration, creation, and weight loading logic to the mcore_bridge dependency, which allows for the removal of significant internal boilerplate code. However, the review highlights several critical issues introduced during the refactoring: the removal of the _BASE_LAYER_SUFFIXES constant and the self.hf_config attribute will lead to runtime errors since they are still referenced in the codebase. Furthermore, the send_weights method contains a NameError due to the use of an undefined args variable, which should be replaced with values from the strategy configuration.

@tastelikefeet
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Megatron-Core integration by delegating model configuration and creation to a new MegatronStrategy class and the mcore_bridge library, while removing the internal args.py and the TorchSampler module. The review identifies a critical AttributeError in multi_lora.py where get_target_modules is called on the wrong class. Additionally, the review highlights that the refactored _add_base_layer_suffix is now too generic and may incorrectly modify non-LoRA layer names, and that MultiLoraMegatron initializes parameter configurations without an optimizer, which could prevent the setup of training-specific features like gradient reduction overlap.

@tastelikefeet
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a GRPO training script for the OlympiadBench dataset, updates the lazy dataset implementation, and includes several improvements to the checkpointing and reward computation logic. My feedback highlights an unsafe mutation of _model_keys during iteration, a likely bug in the packing detection logic, inefficient instantiation of reward classes, and excessive logging that could impact performance.

@tastelikefeet tastelikefeet merged commit 4db3c40 into modelscope:main Apr 3, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants