Skip to content

[#15178][fix] Fix unified-memory Mamba KV estimation#15215

Open
peter941221 wants to merge 1 commit into
NVIDIA:mainfrom
peter941221:fix/unified-mem-mamba-kv-est-clean
Open

[#15178][fix] Fix unified-memory Mamba KV estimation#15215
peter941221 wants to merge 1 commit into
NVIDIA:mainfrom
peter941221:fix/unified-mem-mamba-kv-est-clean

Conversation

@peter941221

@peter941221 peter941221 commented Jun 10, 2026

Copy link
Copy Markdown

Description

Refs #15178.

On integrated GPUs, the estimation dry run can start from a mem_get_info() budget that is already depressed by mmap-backed weights sharing the same physical memory pool.

When hybrid Mamba models use an affine CacheCost, _get_token_num_for_estimation() subtracts the recurrent-state intercept from that reduced budget and can clamp the provisional token cap to zero. That is enough to trip assert max_blocks_per_seq > 0 in the attention-window path even though the later affine sizing still succeeds.

This change keeps the final affine sizing unchanged. It only relaxes the provisional estimation cap on integrated GPUs by dropping the affine intercept from the dry-run budget calculation.

The regression test covers the zero-clamp case by mocking an affine CacheCost, a small mem_get_info() budget, and an integrated device.

Test Coverage

Validated in the matching TensorRT-LLM 1.3.0rc18 CUDA 13 / PyTorch 2.10 runtime with:

  • python -m pytest tests/unittest/_torch/executor/test_kv_cache_estimation.py -k integrated_gpu_estimation_ignores_affine_intercept -q
  • python -m pytest tests/unittest/_torch/executor/test_kv_cache_estimation.py -q

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Bug Fixes

  • Improved KV-cache token estimation accuracy on integrated and unified-memory devices for more precise token capacity calculations.

Tests

  • Added regression test for KV-cache token estimation behavior on integrated GPU systems.

Signed-off-by: peter941221 <peter941221@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0f2d7ee-abac-48f6-b139-6752436bfaf6

📥 Commits

Reviewing files that changed from the base of the PR and between 90cb7ff and 38864aa.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/unittest/_torch/executor/test_kv_cache_estimation.py

📝 Walkthrough

Walkthrough

The PR fixes KV-cache token estimation on integrated/unified-memory GPUs by importing a device detection utility and conditionally zeroing the affine cost intercept during dry-run sizing, preventing token-cap collapse. A regression test validates the fix.

Changes

KV-cache Estimation for Integrated GPUs

Layer / File(s) Summary
Intercept-zeroing logic and regression test
tensorrt_llm/_torch/pyexecutor/_util.py, tests/unittest/_torch/executor/test_kv_cache_estimation.py
Imports is_device_integrated and detects positive intercepts in the KV-size-per-token affine model; on integrated devices, the intercept is zeroed during token-budget calculation to prevent the provisional token cap from collapsing. A new regression test stubs GPU memory and verifies that token estimation still rounds correctly to the expected block-based total when is_device_integrated=True and an affine intercept is present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing unified-memory (integrated GPU) Mamba KV cache estimation, with proper GitHub issue reference and fix type.
Description check ✅ Passed The description includes all key required sections: a clear explanation of the issue and solution, test coverage details with specific pytest commands, and a completed PR checklist confirming guideline compliance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant