[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA by JulianCloudNTH · Pull Request #20167 · pytorch/executorch

JulianCloudNTH · 2026-06-09T21:16:24Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its kernel_name so the WebGPUQueryPool can attribute on-GPU time to the attention stage that produced it. sdpa_with_kv_cache runs four chained dispatches — update_cache -> QK (attn_weights) -> softmax -> AV (compute_out); WebGPUGraph::execute() brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D107678235

[ghstack-poisoned]

pytorch-bot · 2026-06-09T21:16:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20167

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit ebf063e with merge base 5526971 ():

NEW FAILURE - The following job has failed:

pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t 667adb4a75b6c7cd4dd0a01a121b3a6c40992dac715c8b2a6cbb3c7daeeb81e2 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-09T21:17:12Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391669549 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391741952 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391801048 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 392065610 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

Update

25c045f

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 21:16

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

Update

8a99e7a

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 10, 2026

Update

5beb63e

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 10, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Open

JulianCloudNTH changed the title ~~[ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool)~~ [ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA Jun 10, 2026

Update

efb6b7f

[ghstack-poisoned]

Update

0103656

[ghstack-poisoned]

Update

ebf063e

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA#20167

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA#20167
JulianCloudNTH wants to merge 6 commits into
gh/JulianCloudNTH/21/basefrom
gh/JulianCloudNTH/21/head

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20167

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label