Attention sink support for LLM runner by kirklandsign · Pull Request #18753 · pytorch/executorch

kirklandsign · 2026-04-07T23:51:49Z

Summary:
Rewrite the Attention Sink KV cache implementation from eviction-based to ring buffer approach for torch.export compatibility.

Key changes:

Ring buffer KV cache: Replace dynamic eviction (torch.cat, narrow, shift) with fixed-size ring buffer using index_copy_. Cache layout: [sink slots | ring buffer slots]. Sink tokens (e.g., BOS) occupy fixed positions; window tokens wrap around in the ring buffer region.
Remove eviction_batch_size: No longer needed -- ring buffer overwrites old entries automatically. Removed from all interfaces (attention_sink.py, model.py, llm_config.py, yaml config).
Remove attention_sink_forward: No more monkey-patching AttentionMHA.forward. Instead, KVCacheWithAttentionSink sets is_ring_buffer=True, and AttentionMHA.forward handles ring buffer models natively (skip start_pos bounds check, compute mask after KV update).
Remove rerotate_k / position shifting: Ring buffer uses original positions for RoPE -- no re-rotation needed.
Fix C++ runner: Remove TEMPORARY max_new_tokens hack. Add max_seq_len prefill check. Make context length check conditional for sliding window models.
Rewrite tests: Replace 16 eviction-based tests with 18 ring buffer tests covering sink preservation, ring wrapping, causal masking, and degenerate (sink_size=0) cases.
Add llama_attention_sink.yaml: Example config for attention sink export.

Differential Revision: D99900289

pytorch-bot · 2026-04-07T23:51:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18753

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 8e7877f with merge base e109ac8 ():

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
test_w2l_fp16
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-07T23:51:58Z

@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99900289.

github-actions · 2026-04-07T23:52:44Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Refactors the Attention Sink KV-cache implementation from eviction/shift logic to a fixed-size ring buffer to improve torch.export compatibility, and updates the runtime and tests accordingly.

Changes:

Implement ring-buffer-based Attention Sink KV cache + cache position management and integrate with AttentionMHA.forward via an is_ring_buffer mode.
Update runner/config parsing to remove eviction_batch_size and adjust generation/prefill constraints for sliding-window models.
Rewrite Attention Sink tests and add an example YAML config for export.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
extension/llm/runner/text_llm_runner.cpp	Adds `max_seq_len` prefill-chunk validation and adjusts `max_new_tokens` budgeting for sliding-window/ring-buffer models.
extension/llm/export/config/llm_config.py	Updates `use_attention_sink` validation to the new 2-field format.
examples/models/llama/source_transformation/attention_sink.py	Replaces eviction-based Attention Sink with ring-buffer KV cache, adds sink-aware causal mask and position manager.
examples/models/llama/attention.py	Updates `AttentionMHA.forward` to natively support ring-buffer caches by deferring mask creation until after KV update.
examples/models/llama/source_transformation/custom_kv_cache.py	Prevents converting `KVCacheWithAttentionSink` into `CustomKVCache`.
examples/models/llama/source_transformation/test_attention_sink.py	Rewrites tests to cover ring-buffer behavior, sink preservation, wrapping, and masking.
examples/models/llama/model.py	Updates parsing/validation of `use_attention_sink` and enforces incompatibility with `use_sdpa_with_kv_cache`.
examples/models/llama/config/test_llm_config.py	Updates config validation tests for the new Attention Sink format.
examples/models/llama/config/llama_attention_sink.yaml	Adds an example config for Attention Sink export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T23:55:56Z

extension/llm/export/config/llm_config.py

        if self.use_attention_sink:
            attention_sink_params = self.use_attention_sink.split(",")
-            if len(attention_sink_params) != 3:
+            if len(attention_sink_params) < 2:
                raise ValueError(
-                    "The value of use_attention_sink must be structured like '<sink_size>,<window_size>,<batch_eviction_size>'"
+                    "The value of use_attention_sink must be structured like '<sink_size>,<window_size>'"
                )


ModelConfig._validate_attention_sink currently allows 3+ comma-separated values (len < 2). Downstream code (e.g., examples/models/llama/model.py) now asserts exactly 2 values, so configs like "4,2048,1024" will pass validation here but fail later at runtime. Consider validating len(attention_sink_params) == 2 (and optionally validating both parse as ints) to keep configuration errors localized and consistent.

Copilot · 2026-04-07T23:55:57Z

examples/models/llama/source_transformation/attention_sink.py

+        start_pos = input_pos[0].item()
+        torch._check_is_size(start_pos)
+
+        orig_indices = torch.arange(seq_len, dtype=torch.long) + start_pos
+
+        # Sink tokens go to fixed slots; window tokens use ring buffer
+        indices = torch.where(
+            orig_indices < self.sink_size,
+            orig_indices,
+            self.sink_size + (orig_indices - self.sink_size) % self.ring_size,
        )
-        rerotation_sin = (
-            new_freqs_sin * original_freqs_cos - new_freqs_cos * original_freqs_sin
+
+        # Update cache_positions exactly like original CachePositionsManager
+        full_t = torch.full((self.max_context_length,), -1, dtype=torch.long)
+        arange_tensor = torch.arange(self.max_context_length, dtype=torch.long)
+        cache_positions = torch.where(
+            arange_tensor < start_pos, self.cache_positions, full_t
        )
+        self.cache_positions.copy_(cache_positions)
+        self.cache_positions.index_copy_(0, indices, orig_indices)


CachePositionsManagerWithSink.calculate_positions_and_update_indices builds orig_indices/full_t/arange_tensor on the default device (CPU). If the module/buffers are moved to CUDA (e.g., model.to('cuda')), the subsequent torch.where/copy_/index_copy_ will error due to device mismatch. Consider creating these tensors on self.cache_positions.device (and matching dtype) so the manager works on any device.

Copilot · 2026-04-07T23:55:57Z

examples/models/llama/source_transformation/attention_sink.py

+        assert seq_len <= self.cache_positions_manager.ring_size, (
+            f"Prefill sequence length ({seq_len}) exceeds ring buffer capacity "
+            f"({self.cache_positions_manager.ring_size}), which would cause "


KVCacheWithAttentionSink.update asserts seq_len <= ring_size. This rejects valid prefills that include sink tokens (e.g., starting at pos=0 with seq_len == sink_size + ring_size), and it will fail with the current tests that prefill the entire cache. If the goal is to prevent duplicate indices within a single index_copy_ call, consider making the constraint depend on start_pos (e.g., only enforce the ring_size limit for the window portion) or otherwise ensure indices are unique before index_copy_.

Suggested change

assert seq_len <= self.cache_positions_manager.ring_size, (

f"Prefill sequence length ({seq_len}) exceeds ring buffer capacity "

f"({self.cache_positions_manager.ring_size}), which would cause "

start_pos = int(input_pos.reshape(-1)[0].item())

sink_tokens_remaining = max(0, self.sink_size - start_pos)

ring_write_len = max(0, seq_len - sink_tokens_remaining)

assert ring_write_len <= self.cache_positions_manager.ring_size, (

f"Update writes {ring_write_len} tokens into the ring buffer, "

f"which exceeds ring buffer capacity "

f"({self.cache_positions_manager.ring_size}) and would cause "

Summary: Rewrite the Attention Sink KV cache implementation from eviction-based to ring buffer approach for torch.export compatibility. Key changes: - Ring buffer KV cache: Replace dynamic eviction (torch.cat, narrow, shift) with fixed-size ring buffer using index_copy_. Cache layout: [sink slots | ring buffer slots]. Sink tokens (e.g., BOS) occupy fixed positions; window tokens wrap around in the ring buffer region. - Remove eviction_batch_size: No longer needed -- ring buffer overwrites old entries automatically. Removed from all interfaces (attention_sink.py, model.py, llm_config.py, yaml config). - Remove attention_sink_forward: No more monkey-patching AttentionMHA.forward. Instead, KVCacheWithAttentionSink sets is_ring_buffer=True, and AttentionMHA.forward handles ring buffer models natively (skip start_pos bounds check, compute mask after KV update). - Remove rerotate_k / position shifting: Ring buffer uses original positions for RoPE -- no re-rotation needed. - Fix C++ runner: Remove TEMPORARY max_new_tokens hack. Add max_seq_len prefill check. Make context length check conditional for sliding window models. - Rewrite tests: Replace 16 eviction-based tests with 18 ring buffer tests covering sink preservation, ring wrapping, causal masking, and degenerate (sink_size=0) cases. - Add llama_attention_sink.yaml: Example config for attention sink export. Differential Revision: D99900289

kirklandsign requested review from larryliu0820, lucylq and mergennachin as code owners April 7, 2026 23:51

Copilot AI review requested due to automatic review settings April 7, 2026 23:51

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 7, 2026

Copilot started reviewing on behalf of kirklandsign April 7, 2026 23:52 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

meta-codesync bot force-pushed the export-D99900289 branch from 7cff409 to 8e7877f Compare April 8, 2026 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention sink support for LLM runner#18753

Attention sink support for LLM runner#18753
kirklandsign wants to merge 1 commit intomainfrom
export-D99900289

kirklandsign commented Apr 7, 2026

Uh oh!

pytorch-bot bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        assert seq_len <= self.cache_positions_manager.ring_size, (
-            f"Prefill sequence length ({seq_len}) exceeds ring buffer capacity "
-            f"({self.cache_positions_manager.ring_size}), which would cause "
+        start_pos = int(input_pos.reshape(-1)[0].item())
+        sink_tokens_remaining = max(0, self.sink_size - start_pos)
+        ring_write_len = max(0, seq_len - sink_tokens_remaining)
+        assert ring_write_len <= self.cache_positions_manager.ring_size, (
+            f"Update writes {ring_write_len} tokens into the ring buffer, "
+            f"which exceeds ring buffer capacity "
+            f"({self.cache_positions_manager.ring_size}) and would cause "

Conversation

kirklandsign commented Apr 7, 2026

Uh oh!

pytorch-bot bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18753

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Apr 7, 2026 •

edited

Loading

This PR needs a `release notes:` label