[ExecuTorch][BE] Split kv cache and SDPA for better code sharing #7413

kimishpatel · 2024-12-20T05:11:49Z

Stack from ghstack (oldest at bottom):

Summary:

Why?
We have coupled SDPA with kv cache for a while. Initially this was done
as we implemented sdpa_with_kv_cache custom op to reduce multiple copy
overheads from kv cache update. (This could have been done by having
separate custom kv cache update and custom sdpa op. Recent changes
enabled this.)
As a result of SDPA module owning kv cache, we get a) non-composable
implementation and b) harder to reuse model definition and components
from repos like tune. Output of this is that we have multiple definition
of the same model, llama, lying around in ET, TorchChat and Tune. This
diff and subsequent ones will try to move in the direction where custom
kv cache and custom sdpa become decoupled and composable, making it more
module-swap friendly with tune's model definition.

How.
Earlier PRs decoupled kv cache update from sdpa. So now

Decouple SDPA nn.Module from KV cache.
Standardize on KVCache and SDPA interface. That is KVCache and SDPA
both operate on q, k, v in [B, # heads, seq_len, head_dim] formatted
tensors.
2 will introduce multiple tranposes when KVCache and SDPA are
replaced by custom modules, but we will write graph pass to undo
those.

Test Plan:
Existing tests.
Make sure perf doesnt regress

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2024-12-20T05:11:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7413

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit 275144b with merge base 49cc399 ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels
Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/export/test_export_passes.py:
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv) / linux-job (gh)
RuntimeError: Command docker exec -t 4dd4a92990c7fc560c2c9d522c49777ddb0fb4a9b9c68986d37aa2777377aa19 /exec failed with exit code 1
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 15c3d4f19d9c2d7f22a42ac65c090eb928f129e79561e58d80fc16c2f3a80e6b /exec failed with exit code 1
pull / test-llava-runner-linux / linux-job (gh)
test_prefill_logits
pull / unittest / linux / linux-job (gh)
examples/models/llama/tests/test_simple_sdpa.py::SDPATest::test_simple_sdpa
pull / unittest / macos / macos-job (gh)
examples/models/llama/tests/test_simple_sdpa.py::SDPATest::test_simple_sdpa

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6356acba83a82cb7d19747187a254a735fa77d28 Pull Request resolved: #7413

github-actions · 2024-12-20T05:12:34Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: + Make all the backend specific kvcache and sdpa implementation abide by the new API Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 369434c4d64e6d4500ecfea03b0fd99945b30461 Pull Request resolved: #7413

…sharing" Summary: Why? We have coupled SDPA with kv cache for a while. Initially this was done as we implemented sdpa_with_kv_cache custom op to reduce multiple copy overheads from kv cache update. (This could have been done by having separate custom kv cache update and custom sdpa op. Recent changes enabled this.) As a result of SDPA module owning kv cache, we get a) non-composable implementation and b) harder to reuse model definition and components from repos like tune. Output of this is that we have multiple definition of the same model, llama, lying around in ET, TorchChat and Tune. This diff and subsequent ones will try to move in the direction where custom kv cache and custom sdpa become decoupled and composable, making it more module-swap friendly with tune's model definition. How. Earlier PRs decoupled kv cache update from sdpa. So now 1. Decouple SDPA nn.Module from KV cache. 2. Standardize on KVCache and SDPA interface. That is KVCache and SDPA both operate on q, k, v in [B, # heads, seq_len, head_dim] formatted tensors. 3. 2 will introduce multiple tranposes when KVCache and SDPA are replaced by custom modules, but we will write graph pass to undo those. Test Plan: Existing tests. Make sure perf doesnt regress [ghstack-poisoned]

Summary: Why? We have coupled SDPA with kv cache for a while. Initially this was done as we implemented sdpa_with_kv_cache custom op to reduce multiple copy overheads from kv cache update. (This could have been done by having separate custom kv cache update and custom sdpa op. Recent changes enabled this.) As a result of SDPA module owning kv cache, we get a) non-composable implementation and b) harder to reuse model definition and components from repos like tune. Output of this is that we have multiple definition of the same model, llama, lying around in ET, TorchChat and Tune. This diff and subsequent ones will try to move in the direction where custom kv cache and custom sdpa become decoupled and composable, making it more module-swap friendly with tune's model definition. How. Earlier PRs decoupled kv cache update from sdpa. So now 1. Decouple SDPA nn.Module from KV cache. 2. Standardize on KVCache and SDPA interface. That is KVCache and SDPA both operate on q, k, v in [B, # heads, seq_len, head_dim] formatted tensors. 3. 2 will introduce multiple tranposes when KVCache and SDPA are replaced by custom modules, but we will write graph pass to undo those. Test Plan: Existing tests. Make sure perf doesnt regress ghstack-source-id: 6289ce22a2c190da7e38e098ba8a5d0254d6bf9d Pull Request resolved: #7413

dvorjackz · 2024-12-23T19:49:23Z

extension/llm/export/builder.py

@@ -212,6 +215,13 @@ def export(self) -> "LLMEdgeManager":

        return self

+    def run_canonical_optimizations(self):


Comment on this function

dvorjackz · 2024-12-23T20:42:07Z

examples/models/llama/source_transformation/sdpa.py

@@ -47,20 +37,21 @@ def forward(
        seqlen,
        mask,
    ):
+        q = q.transpose(1, 2)  # (bs, seqlen, n_local_heads, head_dim)


I just thought about it again, and adding this transpose here and also before in the llama_transformer.py so that we can share code for kv_cache.py (this is the reason right?) doesn't really make sense since we are using a custom export-friendly KV cache already anyways: https://github.com/pytorch/executorch/blob/main/extension/llm/modules/kv_cache.py#L13

Changes to split kv cache and sdpa

b981f06

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

kimishpatel mentioned this pull request Dec 20, 2024

[ExecuTorch][Llama] Split custom sdpa op and kv cache #7412

Open

kimishpatel added a commit that referenced this pull request Dec 20, 2024

Changes to split kv cache and sdpa

750e7da

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6356acba83a82cb7d19747187a254a735fa77d28 Pull Request resolved: #7413

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2024

Update on "Changes to split kv cache and sdpa"

5eb4c6f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

kimishpatel changed the title ~~Changes to split kv cache and sdpa~~ [ExecuTorch][BE] Split kv cache and SDPA for better code sharing Dec 21, 2024

kimishpatel requested a review from cccclai December 21, 2024 00:21

dvorjackz reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][BE] Split kv cache and SDPA for better code sharing #7413

[ExecuTorch][BE] Split kv cache and SDPA for better code sharing #7413

kimishpatel commented Dec 20, 2024 •

edited

Loading

pytorch-bot bot commented Dec 20, 2024 •

edited

Loading

github-actions bot commented Dec 20, 2024

dvorjackz Dec 23, 2024

dvorjackz Dec 23, 2024

		@@ -212,6 +215,13 @@ def export(self) -> "LLMEdgeManager":

		return self

		def run_canonical_optimizations(self):

[ExecuTorch][BE] Split kv cache and SDPA for better code sharing #7413

Are you sure you want to change the base?

[ExecuTorch][BE] Split kv cache and SDPA for better code sharing #7413

Conversation

kimishpatel commented Dec 20, 2024 • edited Loading

pytorch-bot bot commented Dec 20, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7413

❌ 7 New Failures

github-actions bot commented Dec 20, 2024

This PR needs a release notes: label

dvorjackz Dec 23, 2024

Choose a reason for hiding this comment

dvorjackz Dec 23, 2024

Choose a reason for hiding this comment

kimishpatel commented Dec 20, 2024 •

edited

Loading

pytorch-bot bot commented Dec 20, 2024 •

edited

Loading

This PR needs a `release notes:` label