Skip to content

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201

Merged
meta-codesync[bot] merged 4 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head
Jun 12, 2026
Merged

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201
meta-codesync[bot] merged 4 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: WebGPUQueryPool, a faithful re-port of Vulkan's vkapi::QueryPool (backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}) — same ShaderDuration data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass timestampWrites descriptor since there is no mid-encoder writeTimestamp; readback via resolveQuerySet + buffer map rather than host-side vkGetQueryPoolResults; the TimestampQuery capability requested as an explicit device feature, fail-open if the adapter lacks it). WebGPUDevice gains timestamp-feature detection, and WebGPUGraph gains a per-dispatch kernel_name label plus execute() bracketing of each compute pass when the pool is active. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D108188287

Differential Revision: D108188287

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 2 Pending

As of commit cea2305 with merge base 5526971 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 2m 2s —— View job


Code Review

  • Read changed files
  • Analyze WebGPUQueryPool implementation
  • Analyze WebGPUDevice / WebGPUGraph wiring
  • Post review feedback

Overall this is a clean, well-scoped re-port of the Vulkan QueryPool. The env-gating (WEBGPU_TIMESTAMP_QUERY), feature detection, and lazy pool creation keep the non-bench path zero-cost, and the lifetime comment on WGPUPassTimestampWrites tw (must outlive BeginComputePass) is exactly the kind of subtlety worth calling out. Nothing blocking. A few notes below.


Correctness / robustness

1. print_results() mean uses integer division before scalingWebGPUQueryPool.cpp:188

const double mean_us = kv.second.first / kv.second.second / 1000.0;

first is uint64_t (sum of ns) and second is uint32_t (count), so first / second is integer ns division, then /1000.0. The truncation is sub-ns and harmless, but it reads as if it were float math. Consider kv.second.first / 1000.0 / kv.second.second for clarity/consistency with total_us.

2. Single global query pool is not concurrency-safeWebGPUDevice.h:29
The pool lives on the shared WebGPUContext and is reset()/recorded/extracted within execute(). If two WebGPUGraphs on the same context ever execute concurrently they'd clobber durations_/the readback buffer. Almost certainly fine given the current single-threaded execution model, but worth a one-line comment noting the assumption since the pool now outlives any single graph.

3. extract_results() prints zeros on map failure but still proceeds to print_results()WebGPUGraph.cpp:577-578
If the readback map fails, durations_ keeps execution_duration_ns = 0 and print_results() will report 0.000 rows as if they were real measurements. Since this is bench-only that's acceptable, but a clearer signal (e.g. skipping the print, or marking the row) would avoid silently-misleading output.

Design / minor

4. Heavy include pulled into a widely-used headerWebGPUDevice.h:13
WebGPUDevice.h now #includes WebGPUQueryPool.h (which drags in <webgpu/webgpu.h>, <array>, <vector>, <string>). Since the member is std::unique_ptr<WebGPUQueryPool>, a forward declaration would suffice here — but WebGPUContext is an aggregate with no out-of-line destructor, so the complete type is needed wherever it's destroyed. If you want to slim the header, move the destructor (or just leave as-is; the coupling is minor).

5. Redundant context lookupWebGPUGraph.cpp:516
build() already caches device_/instance_ from get_default_webgpu_context(); execute() calls it again to reach ctx->querypool. Fine, just noting the pool could alternatively hang off the graph. Current placement (shared across graphs, grows-only capacity) is a reasonable trade-off.

6. ns_per_tick_ is effectively a constantWebGPUQueryPool.h:81
Per the WebGPU spec, timestamp values are already in nanoseconds, so ns_per_tick_ stays 1.0 and the * ns_per_tick_ multiplications are no-ops. Kept presumably to mirror the Vulkan port's timestampPeriod — reasonable, just flagging it's dead-but-intentional.

Questions

  • The PR says unit tests for the query pool live in the companion SDPA diff above this one. Worth confirming the test for the capacity-overrun throw in reset() (WebGPUQueryPool.cpp:74) lands there, since that's the main guard with branching logic introduced here.

[ghstack-poisoned]

@SS-JIA SS-JIA left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]
JulianCloudNTH added a commit that referenced this pull request Jun 12, 2026
…ation)

Pull Request resolved: #20201

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: `WebGPUQueryPool`, a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) — same `ShaderDuration` data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass `timestampWrites` descriptor since there is no mid-encoder `writeTimestamp`; readback via `resolveQuerySet` + buffer map rather than host-side `vkGetQueryPoolResults`; the `TimestampQuery` capability requested as an explicit device feature, fail-open if the adapter lacks it). `WebGPUDevice` gains timestamp-feature detection, and `WebGPUGraph` gains a per-dispatch `kernel_name` label plus `execute()` bracketing of each compute pass when the pool is active. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
ghstack-source-id: 392956205
@exported-using-ghexport

Differential Revision: [D108188287](https://our.internmc.facebook.com/intern/diff/D108188287/)
[ghstack-poisoned]
JulianCloudNTH added a commit that referenced this pull request Jun 12, 2026
…ation)

Pull Request resolved: #20201

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: `WebGPUQueryPool`, a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) — same `ShaderDuration` data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass `timestampWrites` descriptor since there is no mid-encoder `writeTimestamp`; readback via `resolveQuerySet` + buffer map rather than host-side `vkGetQueryPoolResults`; the `TimestampQuery` capability requested as an explicit device feature, fail-open if the adapter lacks it). `WebGPUDevice` gains timestamp-feature detection, and `WebGPUGraph` gains a per-dispatch `kernel_name` label plus `execute()` bracketing of each compute pass when the pool is active. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
ghstack-source-id: 392975889
@exported-using-ghexport

Differential Revision: [D108188287](https://our.internmc.facebook.com/intern/diff/D108188287/)
@meta-codesync meta-codesync Bot merged commit d595e01 into gh/JulianCloudNTH/22/base Jun 12, 2026
176 of 179 checks passed
@meta-codesync meta-codesync Bot deleted the gh/JulianCloudNTH/22/head branch June 12, 2026 22:10
@meta-codesync meta-codesync Bot temporarily deployed to cherry-pick-bot June 12, 2026 22:10 Inactive
JulianCloudNTH added a commit that referenced this pull request Jun 13, 2026
…ation)

Pull Request resolved: #20201

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: `WebGPUQueryPool`, a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) — same `ShaderDuration` data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass `timestampWrites` descriptor since there is no mid-encoder `writeTimestamp`; readback via `resolveQuerySet` + buffer map rather than host-side `vkGetQueryPoolResults`; the `TimestampQuery` capability requested as an explicit device feature, fail-open if the adapter lacks it). `WebGPUDevice` gains timestamp-feature detection, and `WebGPUGraph` gains a per-dispatch `kernel_name` label plus `execute()` bracketing of each compute pass when the pool is active. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
ghstack-source-id: 392975889
@exported-using-ghexport

Differential Revision: [D108188287](https://our.internmc.facebook.com/intern/diff/D108188287/)
JulianCloudNTH added a commit that referenced this pull request Jun 13, 2026
…ation) (#20258)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #20201 by
@JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/22/orig

@diff-train-skip-merge

Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants