Skip to content

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep#20227

Open
JulianCloudNTH wants to merge 1 commit into
gh/JulianCloudNTH/24/basefrom
gh/JulianCloudNTH/24/head
Open

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep#20227
JulianCloudNTH wants to merge 1 commit into
gh/JulianCloudNTH/24/basefrom
gh/JulianCloudNTH/24/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Adds the numerical test suite for et_vk.linear_q4gsw (stacked on the op diff), mirroring the SDPA test suite. A named CONFIGS sweep covers real Llama-3.2-1B linear shapes — q/o-proj (2048->2048), k/v-proj (2048->512), gate/up-proj (2048->8192), down-proj (8192->2048), lm_head (2048->128256) — plus 4k/8k large-token prefill (M=4096/8192 on the 2048->2048 and 2048->512 projections). test/ops/quantized_linear/test_quantized_linear.py exports each config's .pte + an fp64 dequant-matmul "truth" golden; test/test_webgpu_native.cpp reconstructs the deterministic ramp input bit-for-bit, runs the op on the GPU, and compares per element; scripts/test_webgpu_native_ci.sh wires the fixtures into the Dawn(Tint)+SwiftShader CI.

Differential Revision: D108314849

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20227

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 8fba37c with merge base 5526971 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 1m 32s —— View job


Code Review: PR #20227linear_q4gsw test suite

  • Read PR diff
  • Analyze test suite (Python export + C++ native test + CI wiring)
  • Post review

Reviewed the four files. This is a clean, well-documented suite that closely mirrors the existing SDPA harness. The design is sound: a single CONFIGS table is the source of truth, the C++ kQ4gswConfigs table mirrors it, the deterministic ramp is reconstructed bit-for-bit, and the fp64 dequant-matmul "truth" golden is the right oracle. The required/heavy/self-skip + required && !dir.empty() ⇒ FAIL semantics correctly turn a silent export failure into a hard CI failure. No correctness bugs found.

A few observations, mostly minor:

1. Tolerance looser for 4k/8k variants — intentional? (worth confirming)

Per-element error in a dequant-matmul scales with the reduction dim K, not with M (number of token rows). Yet the large-M variants are 10× looser than their M=1 counterparts at the same K=2048:

{"q_proj",    1,    2048, 2048, 1e-4f, 1e-3f, ...},  // K=2048
{"q_proj_4k", 4096, 2048, 2048, 1e-3f, 1e-2f, ...},  // K=2048, 10x looser
{"q_proj_8k", 8192, 2048, 2048, 1e-3f, 1e-2f, ...},  // K=2048, 10x looser

down_proj's looser tol is well-justified (K=8192, big-K accumulation) and even has a // big-K accum comment. But for q_proj_4k/kv_proj_4k/q_proj_8k/kv_proj_8k the reduction dim is unchanged from the tight M=1 configs, so the looser bound just weakens the check unless the large-M kernel takes a different tiling/accumulation path that genuinely needs it. If it's the latter, a one-line comment (like the down_proj one) would make the intent explicit; if it's just conservatism, the M=1 tol (1e-4/1e-3) would be a stronger gate.

2. test_export_delegates comment slightly misleading for lm_head

# Fusion is shape-independent, so M=1 keeps even the heavy configs cheap to check.

For lm_head the weight is N×K/2 = 128256*2048/2 ≈ 131 MB regardless of M, so M=1 keeps the input/golden cheap but the exported .pte/in-memory packed weight is still large. The check is correct; the comment implies M=1 makes lm_head cheap overall, which isn't quite true. Minor wording nit. Fix this →

3. sdpa_within_tol is now a special case of quant_within_tol

quant_within_tol is the parameterized version (atol/rtol as args) of the hardcoded sdpa_within_tol (1e-4/1e-3). Not a problem for this PR, but sdpa_within_tol could later be replaced by quant_within_tol(..., 1e-4f, 1e-3f, ...) to drop the duplication. Optional cleanup.

Things I verified are correct

  • Ramp bit-identity: ((i % 17) - 8) / 16 produces small integers over a power-of-two denominator — exactly representable in fp32, and the q4gsw_ramp(0) == -0.5f guard runs unconditionally. Python (np.int64% 17 → fp32 /16) matches C++ bit-for-bit. ✅
  • Group-size constraints: group_size=32 satisfies K%32==0 (2048, 8192) and N%8==0 (2048, 512, 8192, 128256) for every config. ✅
  • Heavy/required consistency: export_all_..._models(include_heavy=False) skips exactly the configs the C++ table marks heavy && !required, so CI gets clean self-skips (no FAIL) for lm_head/q_proj_8k/kv_proj_8k. ✅
  • Path joining: Python os.path.join('/tmp', ...) and C++ dir + "q4gsw_..." with the trailing-slash normalization agree on /tmp/q4gsw_<name>.pte. ✅
  • CI failure propagation: export || echo WARN + .pte absent + required + non-empty dir ⇒ native test FAILs, so a broken export can't pass silently. ✅
  • int indexing: largest in_numel is q_proj_8k = 8192×2048 ≈ 16.8M, well within int range. ✅

Nothing blocking — items 1–3 are nits/clarifications. Item 1 is the only one worth an explicit answer.
· branch gh/JulianCloudNTH/24/head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant