[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) by JulianCloudNTH · Pull Request #20201 · pytorch/executorch

JulianCloudNTH · 2026-06-10T20:19:24Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: WebGPUQueryPool, a faithful re-port of Vulkan's vkapi::QueryPool (backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}) — same ShaderDuration data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass timestampWrites descriptor since there is no mid-encoder writeTimestamp; readback via resolveQuerySet + buffer map rather than host-side vkGetQueryPoolResults; the TimestampQuery capability requested as an explicit device feature, fail-open if the adapter lacks it). WebGPUDevice gains timestamp-feature detection, and WebGPUGraph gains a per-dispatch kernel_name label plus execute() bracketing of each compute pass when the pool is active. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D108188287

[ghstack-poisoned]

pytorch-bot · 2026-06-10T20:19:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 3 New Failures, 2 Pending

As of commit cea2305 with merge base 5526971 ():

NEW FAILURES - The following jobs have failed:

pull / android / build-android (gh)
pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t c68f29de397acb68d20354290bab18ca65a4fdcf8060f6d42dbe30c5257ec025 /exec failed with exit code 1
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 74af7b720c186814bc15bcf8a3b9e3676be0f0a876e9acde03548b1c06087dec /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-10T20:20:24Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-10T20:47:09Z

@claude review

claude · 2026-06-10T20:47:54Z

Claude finished @JulianCloudNTH's task in 2m 2s —— View job

Code Review

Read changed files
Analyze WebGPUQueryPool implementation
Analyze WebGPUDevice / WebGPUGraph wiring
Post review feedback

Overall this is a clean, well-scoped re-port of the Vulkan QueryPool. The env-gating (WEBGPU_TIMESTAMP_QUERY), feature detection, and lazy pool creation keep the non-bench path zero-cost, and the lifetime comment on WGPUPassTimestampWrites tw (must outlive BeginComputePass) is exactly the kind of subtlety worth calling out. Nothing blocking. A few notes below.

Correctness / robustness

1. print_results() mean uses integer division before scaling — WebGPUQueryPool.cpp:188

const double mean_us = kv.second.first / kv.second.second / 1000.0;

first is uint64_t (sum of ns) and second is uint32_t (count), so first / second is integer ns division, then /1000.0. The truncation is sub-ns and harmless, but it reads as if it were float math. Consider kv.second.first / 1000.0 / kv.second.second for clarity/consistency with total_us.

2. Single global query pool is not concurrency-safe — WebGPUDevice.h:29
The pool lives on the shared WebGPUContext and is reset()/recorded/extracted within execute(). If two WebGPUGraphs on the same context ever execute concurrently they'd clobber durations_/the readback buffer. Almost certainly fine given the current single-threaded execution model, but worth a one-line comment noting the assumption since the pool now outlives any single graph.

3. extract_results() prints zeros on map failure but still proceeds to print_results() — WebGPUGraph.cpp:577-578
If the readback map fails, durations_ keeps execution_duration_ns = 0 and print_results() will report 0.000 rows as if they were real measurements. Since this is bench-only that's acceptable, but a clearer signal (e.g. skipping the print, or marking the row) would avoid silently-misleading output.

Design / minor

4. Heavy include pulled into a widely-used header — WebGPUDevice.h:13
WebGPUDevice.h now #includes WebGPUQueryPool.h (which drags in <webgpu/webgpu.h>, <array>, <vector>, <string>). Since the member is std::unique_ptr<WebGPUQueryPool>, a forward declaration would suffice here — but WebGPUContext is an aggregate with no out-of-line destructor, so the complete type is needed wherever it's destroyed. If you want to slim the header, move the destructor (or just leave as-is; the coupling is minor).

5. Redundant context lookup — WebGPUGraph.cpp:516
build() already caches device_/instance_ from get_default_webgpu_context(); execute() calls it again to reach ctx->querypool. Fine, just noting the pool could alternatively hang off the graph. Current placement (shared across graphs, grows-only capacity) is a reasonable trade-off.

6. ns_per_tick_ is effectively a constant — WebGPUQueryPool.h:81
Per the WebGPU spec, timestamp values are already in nanoseconds, so ns_per_tick_ stays 1.0 and the * ns_per_tick_ multiplications are no-ops. Kept presumably to mirror the Vulkan port's timestampPeriod — reasonable, just flagging it's dead-but-intentional.

Questions

The PR says unit tests for the query pool live in the companion SDPA diff above this one. Worth confirming the test for the capacity-overrun throw in reset() (WebGPUQueryPool.cpp:74) lands there, since that's the main guard with branching logic introduced here.

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

…ation) Pull Request resolved: #20201 Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: `WebGPUQueryPool`, a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) — same `ShaderDuration` data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass `timestampWrites` descriptor since there is no mid-encoder `writeTimestamp`; readback via `resolveQuerySet` + buffer map rather than host-side `vkGetQueryPoolResults`; the `TimestampQuery` capability requested as an explicit device feature, fail-open if the adapter lacks it). `WebGPUDevice` gains timestamp-feature detection, and `WebGPUGraph` gains a per-dispatch `kernel_name` label plus `execute()` bracketing of each compute pass when the pool is active. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op. Co-authored with Claude. ghstack-source-id: 392956205 @exported-using-ghexport Differential Revision: [D108188287](https://our.internmc.facebook.com/intern/diff/D108188287/)

[ghstack-poisoned]

…ation) Pull Request resolved: #20201 Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: `WebGPUQueryPool`, a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) — same `ShaderDuration` data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass `timestampWrites` descriptor since there is no mid-encoder `writeTimestamp`; readback via `resolveQuerySet` + buffer map rather than host-side `vkGetQueryPoolResults`; the `TimestampQuery` capability requested as an explicit device feature, fail-open if the adapter lacks it). `WebGPUDevice` gains timestamp-feature detection, and `WebGPUGraph` gains a per-dispatch `kernel_name` label plus `execute()` bracketing of each compute pass when the pool is active. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op. Co-authored with Claude. ghstack-source-id: 392975889 @exported-using-ghexport Differential Revision: [D108188287](https://our.internmc.facebook.com/intern/diff/D108188287/)

@JulianCloudNTH

…ation) (#20258) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #20201 by @JulianCloudNTH ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/orig @diff-train-skip-merge Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>

Update

772fde7

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 10, 2026 20:19

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026

Update

dbcc312

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 10, 2026

This was referenced Jun 11, 2026

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226

Merged

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227

Merged

psiddh approved these changes Jun 12, 2026

View reviewed changes

SS-JIA approved these changes Jun 12, 2026

View reviewed changes

Update

66e8fa4

[ghstack-poisoned]

Update

cea2305

[ghstack-poisoned]

meta-codesync Bot merged commit d595e01 into gh/JulianCloudNTH/22/base Jun 12, 2026
176 of 179 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/22/head branch June 12, 2026 22:10

meta-codesync Bot temporarily deployed to cherry-pick-bot June 12, 2026 22:10 Inactive

pytorchbot mentioned this pull request Jun 12, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201
meta-codesync[bot] merged 4 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head

JulianCloudNTH commented Jun 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

JulianCloudNTH commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JulianCloudNTH commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

❗ 1 Active SEVs

❌ 3 New Failures, 2 Pending

Uh oh!

github-actions Bot commented Jun 10, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Correctness / robustness

Design / minor

Questions

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 10, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 10, 2026 •

edited

Loading