Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553

gyou2021 · 2024-09-18T12:25:39Z

Auto TP in auto_tp.py handles linear type modules in emerging complex models. 1) The result of some linear modules in a model should operate all reduce operation after running on multiple HPU/GPU cards; The name of those linear modules may be different from those in the method tp_parser(). 2) The weight of some linear modules in a model CANNOT be split to multiple HPU/GPU cards; 3) The weight of some linear modules in a model should NOT be split to multiple HPU/GPU cards to avoid decreasing performance because of afterward all gather operation (gather result from all cards). In case 1) the Linear type should change to LinearAllreduce type in DeepSpeed. In cases 2) and 3) the linear modules should keep Linear type. A configurable auto TP was proposed to handle those cases easily. The method tp_parser() will add the linear modules in case 1) (Here module name list was stored in the environment variable 'DS_ALL_REDUCE_LINEAR_ITEMS') and the method _replace() will add the linear modules in case 2) and 3) (Here module name list was stored in the environment variable 'DS_KEEP_LINEAR_ITEMS'). Those environment variables are configurable. They can be configured directly in environment variables or a configuration file.

Take the Mixtral 8x7B model as an example:
We will add 'w2' to LinearAllreduce, and keep 'gate' as Linear. 'o_proj' is the default deepspeed LinearAllreduce layer.
Add the following into some main code.
import os
os.environ["DS_ALL_REDUCE_LINEAR_ITEMS"] = "{'w2':'mixtral'}"
os.environ["DS_KEEP_LINEAR_ITEMS"] = "{'gate':'mixtral'}"

Origin Mixtral model:

Mixtral model of auto_TP:

…verse models

delock · 2024-09-19T01:58:02Z

Hi @gyou2021 I like the goal to avoid repetition of same logic from L296 to L315, but I also have concern that models enabled by these lines will not be able to run out-of-box with this PR. This may not be friendly to self-helping users without access to proper BKC documentation to various models.

Could allReduceLinearItems have an initial value as a built-in list, then pre-pend with os.environment to get runtime configurability? I think if the model to be enabled by environment is a public model, it should be contributed to the built-in list to provide OOB experience, right?

deepspeed/module_inject/auto_tp.py

loadams · 2025-01-08T19:14:25Z

Hi @delock and @gyou2021 - what more needs to be done to complete this PR? Just a review/approval? Any other changes?

delock · 2025-01-14T06:55:09Z

@loadams let me check with gyou on this PR status.

…ith existing autoTP rules.

gyou2021 · 2025-01-20T06:39:30Z

Sure. I updated the code to enable it to run out-of-box. Thank you for your comments.

Hi @gyou2021 I like the goal to avoid repetition of same logic from L296 to L315, but I also have concern that models enabled by these lines will not be able to run out-of-box with this PR. This may not be friendly to self-helping users without access to proper BKC documentation to various models.

Could allReduceLinearItems have an initial value as a built-in list, then pre-pend with os.environment to get runtime configurability? I think if the model to be enabled by environment is a public model, it should be contributed to the built-in list to provide OOB experience, right?

deepspeed/module_inject/auto_tp.py

delock · 2025-01-21T09:04:58Z

@loadams my questions are all resolved and I have no further question to @gyou2021 , thanks!

…pSpeed into configurable_autoTP. Update the latest code.

gyou2021 added 2 commits September 18, 2024 10:18

Enabled Qwen2-MoE Tensor Parallism (TP) inference

08f728d

Enabled configurable auto Tensor Parallelism (TP) for inference of di…

f6e8637

…verse models

gyou2021 requested review from awan-10 and arashb as code owners September 18, 2024 12:25

inkcherry reviewed Sep 19, 2024

View reviewed changes

deepspeed/module_inject/auto_tp.py Show resolved Hide resolved

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

delock reviewed Sep 25, 2024

View reviewed changes

deepspeed/module_inject/auto_tp.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

Merge branch 'master' into configurable_autoTP

f45a7a9

Merge branch 'master' into configurable_autoTP

397402e

loadams requested review from hwchen2017 and loadams as code owners January 8, 2025 19:14

Merge branch 'master' into configurable_autoTP

da362a3

loadams self-assigned this Jan 13, 2025

gyou2021 added 4 commits January 17, 2025 18:26

Merge branch 'microsoft:master' into configurable_autoTP

413c380

Merge branch 'microsoft:master' into configurable_autoTP

b442b08

Enabled configurable autoTP to run out-of-box and remain compatible w…

a2ee47a

…ith existing autoTP rules.

fixed merging conflicts

75a1d91

delock reviewed Jan 20, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Show resolved Hide resolved

delock reviewed Jan 20, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Show resolved Hide resolved

delock reviewed Jan 20, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Outdated Show resolved Hide resolved

delock reviewed Jan 20, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Show resolved Hide resolved

gyou2021 added 3 commits January 20, 2025 10:02

Added input examples and fixed bugs when input is None.

5f91466

Added the explanation of DS_REMOVED_COMMON_REDUCE_LINEAR_KEYS

86195c8

Fixed error names

0f54a95

loadams and others added 3 commits January 21, 2025 11:49

Merge branch 'master' into configurable_autoTP

ecad549

Merge remote-tracking branch 'origin/master' into configurable_autoTP

e3d105e

Merge branch 'configurable_autoTP' of https://github.com/gyou2021/Dee…

99455b7

…pSpeed into configurable_autoTP. Update the latest code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553

gyou2021 commented Sep 18, 2024 •

edited

Loading

delock commented Sep 19, 2024 •

edited

Loading

This comment was marked as resolved.

loadams commented Jan 8, 2025

delock commented Jan 14, 2025

gyou2021 commented Jan 20, 2025

delock commented Jan 21, 2025

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553

Are you sure you want to change the base?

Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553

Conversation

gyou2021 commented Sep 18, 2024 • edited Loading

delock commented Sep 19, 2024 • edited Loading

This comment was marked as resolved.

loadams commented Jan 8, 2025

delock commented Jan 14, 2025

gyou2021 commented Jan 20, 2025

delock commented Jan 21, 2025

gyou2021 commented Sep 18, 2024 •

edited

Loading

delock commented Sep 19, 2024 •

edited

Loading