[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep by JulianCloudNTH · Pull Request #20227 · pytorch/executorch

JulianCloudNTH · 2026-06-11T23:41:48Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167
-> [ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227
[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Adds the numerical test suite for et_vk.linear_q4gsw (stacked on the op diff), mirroring the SDPA test suite. A named CONFIGS sweep covers real Llama-3.2-1B linear shapes — q/o-proj (2048->2048), k/v-proj (2048->512), gate/up-proj (2048->8192), down-proj (8192->2048), lm_head (2048->128256) — plus 4k/8k large-token prefill (M=4096/8192 on the 2048->2048 and 2048->512 projections). test/ops/quantized_linear/test_quantized_linear.py exports each config's .pte + an fp64 dequant-matmul "truth" golden; test/test_webgpu_native.cpp reconstructs the deterministic ramp input bit-for-bit, runs the op on the GPU, and compares per element; scripts/test_webgpu_native_ci.sh wires the fixtures into the Dawn(Tint)+SwiftShader CI.

Differential Revision: D108314849

[ghstack-poisoned]

pytorch-bot · 2026-06-11T23:41:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20227

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 8fba37c with merge base 5526971 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.
pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-11T23:42:52Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-11T23:43:11Z

@claude review

claude · 2026-06-11T23:43:58Z

Claude finished @JulianCloudNTH's task in 1m 32s —— View job

Code Review: PR #20227 — `linear_q4gsw` test suite

Read PR diff
Analyze test suite (Python export + C++ native test + CI wiring)
Post review

Reviewed the four files. This is a clean, well-documented suite that closely mirrors the existing SDPA harness. The design is sound: a single CONFIGS table is the source of truth, the C++ kQ4gswConfigs table mirrors it, the deterministic ramp is reconstructed bit-for-bit, and the fp64 dequant-matmul "truth" golden is the right oracle. The required/heavy/self-skip + required && !dir.empty() ⇒ FAIL semantics correctly turn a silent export failure into a hard CI failure. No correctness bugs found.

A few observations, mostly minor:

1. Tolerance looser for 4k/8k variants — intentional? (worth confirming)

Per-element error in a dequant-matmul scales with the reduction dim K, not with M (number of token rows). Yet the large-M variants are 10× looser than their M=1 counterparts at the same K=2048:

{"q_proj",    1,    2048, 2048, 1e-4f, 1e-3f, ...},  // K=2048
{"q_proj_4k", 4096, 2048, 2048, 1e-3f, 1e-2f, ...},  // K=2048, 10x looser
{"q_proj_8k", 8192, 2048, 2048, 1e-3f, 1e-2f, ...},  // K=2048, 10x looser

down_proj's looser tol is well-justified (K=8192, big-K accumulation) and even has a // big-K accum comment. But for q_proj_4k/kv_proj_4k/q_proj_8k/kv_proj_8k the reduction dim is unchanged from the tight M=1 configs, so the looser bound just weakens the check unless the large-M kernel takes a different tiling/accumulation path that genuinely needs it. If it's the latter, a one-line comment (like the down_proj one) would make the intent explicit; if it's just conservatism, the M=1 tol (1e-4/1e-3) would be a stronger gate.

2. `test_export_delegates` comment slightly misleading for `lm_head`

# Fusion is shape-independent, so M=1 keeps even the heavy configs cheap to check.

For lm_head the weight is N×K/2 = 128256*2048/2 ≈ 131 MB regardless of M, so M=1 keeps the input/golden cheap but the exported .pte/in-memory packed weight is still large. The check is correct; the comment implies M=1 makes lm_head cheap overall, which isn't quite true. Minor wording nit. Fix this →

3. `sdpa_within_tol` is now a special case of `quant_within_tol`

quant_within_tol is the parameterized version (atol/rtol as args) of the hardcoded sdpa_within_tol (1e-4/1e-3). Not a problem for this PR, but sdpa_within_tol could later be replaced by quant_within_tol(..., 1e-4f, 1e-3f, ...) to drop the duplication. Optional cleanup.

Things I verified are correct

Ramp bit-identity: ((i % 17) - 8) / 16 produces small integers over a power-of-two denominator — exactly representable in fp32, and the q4gsw_ramp(0) == -0.5f guard runs unconditionally. Python (np.int64 → % 17 → fp32 /16) matches C++ bit-for-bit. ✅
Group-size constraints: group_size=32 satisfies K%32==0 (2048, 8192) and N%8==0 (2048, 512, 8192, 128256) for every config. ✅
Heavy/required consistency: export_all_..._models(include_heavy=False) skips exactly the configs the C++ table marks heavy && !required, so CI gets clean self-skips (no FAIL) for lm_head/q_proj_8k/kv_proj_8k. ✅
Path joining: Python os.path.join('/tmp', ...) and C++ dir + "q4gsw_..." with the trailing-slash normalization agree on /tmp/q4gsw_<name>.pte. ✅
CI failure propagation: export || echo WARN + .pte absent + required + non-empty dir ⇒ native test FAILs, so a broken export can't pass silently. ✅
int indexing: largest in_numel is q_proj_8k = 8192×2048 ≈ 16.8M, well within int range. ✅

Nothing blocking — items 1–3 are nits/clarifications. Item 1 is the only one worth an explicit answer.
· branch gh/JulianCloudNTH/24/head

Update

8fba37c

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep#20227

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep#20227
JulianCloudNTH wants to merge 1 commit into
gh/JulianCloudNTH/24/basefrom
gh/JulianCloudNTH/24/head

JulianCloudNTH commented Jun 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

JulianCloudNTH commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20227

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions Bot commented Jun 11, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #20227 — linear_q4gsw test suite

1. Tolerance looser for 4k/8k variants — intentional? (worth confirming)

2. test_export_delegates comment slightly misleading for lm_head

3. sdpa_within_tol is now a special case of quant_within_tol

Things I verified are correct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 11, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 11, 2026 •

edited

Loading

Code Review: PR #20227 — `linear_q4gsw` test suite

2. `test_export_delegates` comment slightly misleading for `lm_head`

3. `sdpa_within_tol` is now a special case of `quant_within_tol`