Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation describing what modifications you need (head/loss/inputs) to support different parallelisms of transformer block #170

Open
jstjohn opened this issue Sep 17, 2024 · 0 comments

Comments

@jstjohn
Copy link
Collaborator

jstjohn commented Sep 17, 2024

Describe everything a "model fine-tuner" at least would need to know to:

  1. Implement custom fine-tune architectures that put different heads on a trunk
  2. Implement custom losses
  3. Implement custom datasets (eg batch support for CP and/or SP?)

Future:
4. Everything a user would need to know to implement a custom layer that supports parallelism? This is more advanced but we can have it on the roadmap.

From @pstjohn:
A part of this I'm still shaky on is what kinds of modifications to the actual models do we need to support these different model-parallel strategies? Some mention that you need to write the underlying models specifically for megatron might be helpful.

Are there common abstractions that allow a model to use all types of megatron parallelization? Or do some models only support a subset of tensor / pipeline / sequence / context parallelism?

Are there any docs on how to tune the combination of these parallel choices for maximum throughput? Or is that done under-the-hood by megatron?

Originally posted by @pstjohn in #153 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant