Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Proper way to use multiple GPUs #2562

Open
0xLienid opened this issue Jun 10, 2024 · 10 comments
Open

[Question] Proper way to use multiple GPUs #2562

0xLienid opened this issue Jun 10, 2024 · 10 comments
Labels
question Question about the usage

Comments

@0xLienid
Copy link

❓ General Questions

What is the proper way to actually utilize multiple GPUs? When I generate config, compile, and load the MLCEngine with multiple tensor shards it will still error out if the model size is larger than a single one of the GPU's memory. Also, if I check nvidia-smi it is only really utilizing one GPU.

e.g. this was run with 4 tensor shards

image

@0xLienid 0xLienid added the question Question about the usage label Jun 10, 2024
@MasterJH5574
Copy link
Collaborator

Hi @0xLienid thanks for the question. There are a few ways to get things right.

  1. run mlc_llm gen_config with --tensor-parallel-shards 4 and run mlc_llm compile directly.
  2. run mlc_llm compile with --overrides "tensor_parallel_shards=4".

If you follow the two ways above, you don't need to specify tensor_parallel_shards when constructing MLCEngine.

It might be more helpful for us to triage the issue you encountered if you don't mind sharing the log printed when running mlc_llm compile your Python script.

@0xLienid
Copy link
Author

will rerun for the logs in a few, but this is my config generation and compilation command calls that led to this GPU usage. for both, parallel_shards is set to 4. when the model is loaded it also says it's using the multi gpu loader.

from mlc_llm.interface.gen_config import gen_config

...

gen_config(
        config=config,
        model=model,
        quantization=quantization_obj,
        conv_template="LM",
        context_window_size=None,
        sliding_window_size=None,
        prefill_chunk_size=None,
        attention_sink_size=None,
        tensor_parallel_shards=parallel_shards,
        max_batch_size=1,
        output=quantization_dir
    )
from mlc_llm.interface.compile import compile as compile_mlc

...

    compile_mlc(
        config=config_file_compile,
        quantization=quantization_obj,
        model_type=model,
        target=target,
        opt=OptimizationFlags.from_str("O2"),
        build_func=build_func,
        system_lib_prefix="auto",
        output=SAVE_DIR / model_name / quantization / "compilation.so",
        overrides=ModelConfigOverride(
            context_window_size=None,
            sliding_window_size=None,
            prefill_chunk_size=None,
            attention_sink_size=None,
            max_batch_size=1,
            tensor_parallel_shards=parallel_shards
        ),
        debug_dump=None
    )

@MasterJH5574
Copy link
Collaborator

Just wanna share some more pointers that may be helpful: in the log of compile, the model metadata will be printed out:

...
[2024-06-10 11:27:27] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-06-10 11:27:33] INFO compile.py:151: Running optimizations using TVM Unity
[2024-06-10 11:27:33] INFO compile.py:171: Registering metadata: {'model_type': 'qwen2',
'quantization': 'q4f16_1', 'context_window_size': 32768, 'sliding_window_size': -1,
'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 8,     <<<<<<<
'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-06-10 11:27:36] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
...

The expectation is to see 4 here.

@MasterJH5574
Copy link
Collaborator

Another thing is, if your local MLC is installed before Jun 7, then you may need to upgrade to the latest nightly, as we fixed some related logic in #2533.

@0xLienid
Copy link
Author

I will double check, but this should have been built from source as of yesterday

@MasterJH5574
Copy link
Collaborator

MasterJH5574 commented Jun 10, 2024

when the model is loaded it also says it's using the multi gpu loader.

If it says this and prints the following log

[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:0
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #1] Loading model to device: cuda:1
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #2] Loading model to device: cuda:2
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #3] Loading model to device: cuda:3

then there is nothing wrong with gen_config and compile. And we might want to check the model size instead. Some log when loading the model will be much appreciated.

@MasterJH5574
Copy link
Collaborator

I will double check, but this should have been built from source as of yesterday

Got it, then it should be fine as #2533 is already included.

@0xLienid
Copy link
Author

when the model is loaded it also says it's using the multi gpu loader.

If it says this and prints the following log

[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:0
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:1

yes. it says this. the model is ~70GB and I have 4 A100s. so it should fit comfortably when sharded across them. for now i've worked around this by increasing the GPU mem share so that it all fits within one, but obviously that's less than ideal.

ok. once my evals are done running i'll rerun the config and compile to get logs

@0xLienid
Copy link
Author

@MasterJH5574 this is the log
image

@MasterJH5574
Copy link
Collaborator

Thanks for sharing! It looks pretty normal actually. How is the log when loading parameters? How much can the progress bar go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants