-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553
base: master
Are you sure you want to change the base?
Conversation
Hi @gyou2021 I like the goal to avoid repetition of same logic from L296 to L315, but I also have concern that models enabled by these lines will not be able to run out-of-box with this PR. This may not be friendly to self-helping users without access to proper BKC documentation to various models. Could |
This comment was marked as resolved.
This comment was marked as resolved.
@loadams let me check with gyou on this PR status. |
Sure. I updated the code to enable it to run out-of-box. Thank you for your comments.
|
…pSpeed into configurable_autoTP. Update the latest code.
Auto TP in auto_tp.py handles linear type modules in emerging complex models. 1) The result of some linear modules in a model should operate all reduce operation after running on multiple HPU/GPU cards; The name of those linear modules may be different from those in the method tp_parser(). 2) The weight of some linear modules in a model CANNOT be split to multiple HPU/GPU cards; 3) The weight of some linear modules in a model should NOT be split to multiple HPU/GPU cards to avoid decreasing performance because of afterward all gather operation (gather result from all cards). In case 1) the Linear type should change to LinearAllreduce type in DeepSpeed. In cases 2) and 3) the linear modules should keep Linear type. A configurable auto TP was proposed to handle those cases easily. The method tp_parser() will add the linear modules in case 1) (Here module name list was stored in the environment variable 'DS_ALL_REDUCE_LINEAR_ITEMS') and the method _replace() will add the linear modules in case 2) and 3) (Here module name list was stored in the environment variable 'DS_KEEP_LINEAR_ITEMS'). Those environment variables are configurable. They can be configured directly in environment variables or a configuration file.
Take the Mixtral 8x7B model as an example:
We will add 'w2' to LinearAllreduce, and keep 'gate' as Linear. 'o_proj' is the default deepspeed LinearAllreduce layer.
Add the following into some main code.
import os
os.environ["DS_ALL_REDUCE_LINEAR_ITEMS"] = "{'w2':'mixtral'}"
os.environ["DS_KEEP_LINEAR_ITEMS"] = "{'gate':'mixtral'}"
Origin Mixtral model:
Mixtral model of auto_TP: