[docs] torch.compile usage guide #11078

youkaichao · 2024-12-10T23:51:35Z

No description provided.

Signed-off-by: youkaichao <[email protected]>

github-actions · 2024-12-10T23:51:47Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-12-11T02:25:22Z

docs/source/usage/torch_compile.rst

+    $ # running a 8B model on H100 with batch size 1, 36.39 seconds of compilation time, 7.7% improvement in latency
+
+    $ python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 --load-format dummy
+    init engine (profile, create kv cache, warmup model) took 11.79 seconds
+    Avg latency: 0.9704469823899369 seconds
+
+    $ python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 --load-format dummy -O "{'level': 3, 'candidate_compile_sizes': [1]}"
+    init engine (profile, create kv cache, warmup model) took 48.18 seconds
+    Avg latency: 0.8950413154981409 seconds
+
+    $ # running a 8B model on L4 with batch size 1, 66.54 seconds of compilation time, 4.1 % improvement in latency
+
+    $ python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 --load-format dummy
+    init engine (profile, create kv cache, warmup model) took 20.63 seconds
+    Avg latency: 7.81603614680001 seconds
+
+    $ python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 --load-format dummy -O "{'level': 3, 'candidate_compile_sizes': [1]}"
+    init engine (profile, create kv cache, warmup model) took 87.17 seconds
+    Avg latency: 7.495755991366673 seconds


I also run it with llama 3.2 1B model:

$ # running a 1B model on H100 with batch size 1, 21.29 seconds of compilation time, 13.7% improvement in latency $ python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --load-format dummy --num-scheduler-steps 16 init engine (profile, create kv cache, warmup model) took 11.79 seconds Avg latency: 0.2771991847005362 seconds $ python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --load-format dummy --num-scheduler-steps 16 -O "{'level': 3, 'candidate_compile_sizes': [1]}" init engine (profile, create kv cache, warmup model) took 33.08 seconds Avg latency: 0.23920089063079406 seconds $ # running a 1B model on L4 with batch size 1, 42.0 seconds of compilation time, 4.0 % improvement in latency $ python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --load-format dummy --num-scheduler-steps 16 init engine (profile, create kv cache, warmup model) took 20.32 seconds Avg latency: 1.526933370166671 seconds $ python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --load-format dummy --num-scheduler-steps 16 -O "{'level': 3, 'candidate_compile_sizes': [1]}" init engine (profile, create kv cache, warmup model) took 62.32 seconds Avg latency: 1.4660025673666648 seconds

tlrmchlsmth · 2024-12-11T02:44:17Z

docs/source/usage/torch_compile.rst

+- **Inductor graph compilation**: Time taken for the inductor to compile the computation graph into Triton kernels. It includes compilation for a general shape and specific shapes. Check the logs for ``Compiling a graph for general shape takes 14.77 s`` and ``Compiling a graph for shape 1 takes 13.52 s``.
+- **Triton kernel compilation**: Time taken for Triton to compile the Triton kernels into GPU kernels. No specific logs are available for this part.


The inductor graph compilation time is inclusive of the triton kernel compilation time, right? Could you clarify that if so?

I don't think so. Some triton kernels might have JIT compilation that is not included in the inductor compilation time.

tlrmchlsmth

Should we document how to enable/disable custom inductor passes?

youkaichao · 2024-12-11T02:59:39Z

docs/source/usage/torch_compile.rst

+    $ # running an 8B model on H100 with various batch sizes, 72.76 seconds of compilation time, 3.9% improvement in throughput
+    $
+    $ # 1. Run the baseline setting
+    $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --load-format dummy --num-scheduler-steps 64
+    init engine (profile, create kv cache, warmup model) took 14.42 seconds
+    Throughput: 44.39 requests/s, 22728.17 total tokens/s, 11364.08 output tokens/s
+
+    $ # 2. Run the same setting with profiling
+    $ VLLM_LOG_BATCHSIZE_INTERVAL=1.0 python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 64
+    INFO 12-10 15:42:47 forward_context.py:58] Batchsize distribution (batchsize, count): [(256, 769), (232, 215), ...]
+
+    $ # 3. The most common batch sizes are 256 and 232, so we can compile the model for these two batch sizes
+    $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 64 -O "{'level': 3, 'candidate_compile_sizes': [232, 256]}"
+    init engine (profile, create kv cache, warmup model) took 87.18 seconds
+    Throughput: 46.11 requests/s, 23606.51 total tokens/s, 11803.26 output tokens/s


repeat for llama 3.2 1B on H100, almost no improvement.

$ # running an 1B model on H100 with various batch sizes, 39.79 seconds of compilation time, 0.5% improvement in throughput $ $ # 1. Run the baseline setting $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --load-format dummy --num-scheduler-steps 64 init engine (profile, create kv cache, warmup model) took 13.14 seconds Throughput: 116.83 requests/s, 59814.48 total tokens/s, 29907.24 output tokens/s $ # 2. Run the same setting with profiling $ VLLM_LOG_BATCHSIZE_INTERVAL=1.0 python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --num-scheduler-steps 64 INFO 12-10 15:42:47 forward_context.py:58] Batchsize distribution (batchsize, count): [(256, 769), (232, 215), ...] $ # 3. The most common batch sizes are 256 and 232, so we can compile the model for these two batch sizes $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --num-scheduler-steps 64 -O "{'level': 3, 'candidate_compile_sizes': [232, 256]}" init engine (profile, create kv cache, warmup model) took 52.93 seconds Throughput: 117.38 requests/s, 60100.50 total tokens/s, 30050.25 output tokens/s

repeat for llama 3.2 1B on L4 (it's even slower):

$ # running an 1B model on L4 with various batch sizes, 58.5 seconds of compilation time, -6.0% improvement in throughput $ $ # 1. Run the baseline setting $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --load-format dummy --num-scheduler-steps 64 init engine (profile, create kv cache, warmup model) took 21.77 seconds Throughput: 16.36 requests/s, 8376.21 total tokens/s, 4188.10 output tokens/s $ # 2. Run the same setting with profiling $ VLLM_LOG_BATCHSIZE_INTERVAL=1.0 python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --load-format dummy --num-scheduler-steps 64 INFO 12-10 15:42:47 forward_context.py:58] Batchsize distribution (batchsize, count): [(256, 769), (232, 215), ...] $ # 3. The most common batch sizes are 256 and 232, so we can compile the model for these two batch sizes $ python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Llama-3.2-1B --load-format dummy --num-scheduler-steps 64 -O "{'level': 3, 'candidate_compile_sizes': [232, 256]}" init engine (profile, create kv cache, warmup model) took 80.27 seconds Throughput: 15.38 requests/s, 7873.07 total tokens/s, 3936.54 output tokens/s

youkaichao · 2024-12-11T03:00:45Z

Should we document how to enable/disable custom inductor passes?

not right now until we have some passes with significant perf gain.

Signed-off-by: youkaichao <[email protected]>

DarkLight1337 · 2024-12-11T03:37:36Z

Can you update the Compatibility Matrix with this feature as well?

youkaichao · 2024-12-11T05:08:32Z

Can you update the Compatibility Matrix with this feature as well?

it's quite complicated, especially for vision language models. Maybe I can just list all the models not supporting torch.compile in the future. The criterion is simple:

For a text-only model, if the modeling file has support_torch_compile, it means torch.compile is supported. If the super class supports it, then subclass also supports it, e.g. class AriaMoELMModel(LlamaModel): .
For a multi-modality model, if the text part supports torch.compile, then it is supported as well.

I think mainly cross-attention models do not support torch.compile right now.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-12-11T06:18:59Z

docs/source/usage/torch_compile.rst

+The following models are currently not supported by ``torch.compile``, because their computation graphs are too dynamic to compile:
+
+- ``InternLM2VEForCausalLM``, ``InternVLChatModel``
+- cross-attention models like ``MllamaForConditionalGeneration`` and ``BartForConditionalGeneration``
+
+The following models should be supported by ``torch.compile`` in the future, but not supported yet due to bandwidth limitations:
+
+- ``Mamba`` related models
+- ``ChameleonModel``, ``ChatGLMModel``, ``DbrxModel``, ``DeepseekModel``, ``MixtralModel``, ``Olmo2Model``, ``Phi3SmallModel``, ``StableLMEpochModel``


@DarkLight1337 I checked all the models, and these should be the list of unsupported models. I will also update the add models page to show how to add support for new models shortly.

bnellnm · 2024-12-11T20:14:44Z

docs/source/usage/torch_compile.rst

+
+To effectively use ``torch.compile``, the TL;DR; is:
+
+- Ensure GPUs are busy executing the model before enabling ``torch.compile``.


Can you clarify this statement? I think I know what it is trying to say but I think it is a bit ambiguous.

do you have better ideas?

Maybe something like "torch.compile works best for models that are GPU bound"?

mergify · 2025-02-15T12:02:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youkaichao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-05-06T13:33:14Z

Closing as stale after discussing with @youkaichao

add docs

f2ffa86

Signed-off-by: youkaichao <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Dec 10, 2024

youkaichao added 25 commits December 10, 2024 16:00

add tldr

b4c3bea

Signed-off-by: youkaichao <[email protected]>

polish

3cc0ad6

Signed-off-by: youkaichao <[email protected]>

polish

c724249

Signed-off-by: youkaichao <[email protected]>

polish

fa3617e

Signed-off-by: youkaichao <[email protected]>

polish

6f109df

Signed-off-by: youkaichao <[email protected]>

polish

3d2efd9

Signed-off-by: youkaichao <[email protected]>

polish

d210052

Signed-off-by: youkaichao <[email protected]>

polish

de53877

Signed-off-by: youkaichao <[email protected]>

polish

94a7e6a

Signed-off-by: youkaichao <[email protected]>

polish

9bbfda3

Signed-off-by: youkaichao <[email protected]>

fix

bb3c12d

Signed-off-by: youkaichao <[email protected]>

fix

0664256

Signed-off-by: youkaichao <[email protected]>

fix

2998b1d

Signed-off-by: youkaichao <[email protected]>

fix

7a6f4e5

Signed-off-by: youkaichao <[email protected]>

fix

7135fd2

Signed-off-by: youkaichao <[email protected]>

fix

4c37f8f

Signed-off-by: youkaichao <[email protected]>

fix

fbd60c9

Signed-off-by: youkaichao <[email protected]>

fix

af4709b

Signed-off-by: youkaichao <[email protected]>

fix

8c3e38d

Signed-off-by: youkaichao <[email protected]>

fix

cd2b8bc

Signed-off-by: youkaichao <[email protected]>

fix

08cf350

Signed-off-by: youkaichao <[email protected]>

fix

3cb59ee

Signed-off-by: youkaichao <[email protected]>

fix

1f9543f

Signed-off-by: youkaichao <[email protected]>

fix

61ae562

Signed-off-by: youkaichao <[email protected]>

add hardware

dd24928

Signed-off-by: youkaichao <[email protected]>

youkaichao commented Dec 11, 2024

View reviewed changes

tlrmchlsmth reviewed Dec 11, 2024

View reviewed changes

youkaichao commented Dec 11, 2024

View reviewed changes

add dummy

f9614f8

Signed-off-by: youkaichao <[email protected]>

youkaichao mentioned this pull request Dec 11, 2024

[torch.compile] allow tracking forward time #11081

Merged

add support status

b8a4c98

Signed-off-by: youkaichao <[email protected]>

youkaichao commented Dec 11, 2024

View reviewed changes

bnellnm reviewed Dec 11, 2024

View reviewed changes

mergify bot added the needs-rebase label Feb 15, 2025

hmellor closed this May 6, 2025

		- Inductor graph compilation: Time taken for the inductor to compile the computation graph into Triton kernels. It includes compilation for a general shape and specific shapes. Check the logs for ``Compiling a graph for general shape takes 14.77 s`` and ``Compiling a graph for shape 1 takes 13.52 s``.
		- Triton kernel compilation: Time taken for Triton to compile the Triton kernels into GPU kernels. No specific logs are available for this part.


		To effectively use ``torch.compile``, the TL;DR; is:

		- Ensure GPUs are busy executing the model before enabling ``torch.compile``.

Uh oh!

[docs] torch.compile usage guide #11078

[docs] torch.compile usage guide #11078

Uh oh!

Conversation

youkaichao commented Dec 10, 2024

Uh oh!

github-actions bot commented Dec 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Dec 11, 2024

Uh oh!

DarkLight1337 commented Dec 11, 2024

Uh oh!

youkaichao commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 15, 2025

Uh oh!

hmellor commented May 6, 2025

Uh oh!

Uh oh!

youkaichao commented Dec 11, 2024 •

edited

Loading