[#15178][fix] Fix unified-memory Mamba KV estimation by peter941221 · Pull Request #15215 · NVIDIA/TensorRT-LLM

peter941221 · 2026-06-10T09:09:50Z

Description

On integrated GPUs, the estimation dry run can start from a mem_get_info() budget that is already depressed by mmap-backed weights sharing the same physical memory pool.

When hybrid Mamba models use an affine CacheCost, _get_token_num_for_estimation() subtracts the recurrent-state intercept from that reduced budget and can clamp the provisional token cap to zero. That is enough to trip assert max_blocks_per_seq > 0 in the attention-window path even though the later affine sizing still succeeds.

This change keeps the final affine sizing unchanged. It only relaxes the provisional estimation cap on integrated GPUs by dropping the affine intercept from the dry-run budget calculation.

The regression test covers the zero-clamp case by mocking an affine CacheCost, a small mem_get_info() budget, and an integrated device.

Test Coverage

Validated in the matching TensorRT-LLM 1.3.0rc18 CUDA 13 / PyTorch 2.10 runtime with:

python -m pytest tests/unittest/_torch/executor/test_kv_cache_estimation.py -k integrated_gpu_estimation_ignores_affine_intercept -q
python -m pytest tests/unittest/_torch/executor/test_kv_cache_estimation.py -q

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Bug Fixes

Improved KV-cache token estimation accuracy on integrated and unified-memory devices for more precise token capacity calculations.

Tests

Added regression test for KV-cache token estimation behavior on integrated GPU systems.

Signed-off-by: peter941221 <peter941221@gmail.com>

coderabbitai · 2026-06-10T09:21:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0f2d7ee-abac-48f6-b139-6752436bfaf6

📥 Commits

Reviewing files that changed from the base of the PR and between 90cb7ff and 38864aa.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/_torch/executor/test_kv_cache_estimation.py

📝 Walkthrough

Walkthrough

The PR fixes KV-cache token estimation on integrated/unified-memory GPUs by importing a device detection utility and conditionally zeroing the affine cost intercept during dry-run sizing, preventing token-cap collapse. A regression test validates the fix.

Changes

KV-cache Estimation for Integrated GPUs

Layer / File(s)	Summary
Intercept-zeroing logic and regression test `tensorrt_llm/_torch/pyexecutor/_util.py`, `tests/unittest/_torch/executor/test_kv_cache_estimation.py`	Imports `is_device_integrated` and detects positive intercepts in the KV-size-per-token affine model; on integrated devices, the intercept is zeroed during token-budget calculation to prevent the provisional token cap from collapsing. A new regression test stubs GPU memory and verifies that token estimation still rounds correctly to the expected block-based total when `is_device_integrated=True` and an affine intercept is present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

[Bug] Hybrid Mamba KV-cache estimation clamps max_tokens to 0 on integrated/unified-memory GPUs (regression in 1.3.0rc15, commit 091ad7b0) #15178: Directly addresses the integrated GPU KV-cache estimation bug by ignoring the affine intercept during dry-run sizing.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing unified-memory (integrated GPU) Mamba KV cache estimation, with proper GitHub issue reference and fix type.
Description check	✅ Passed	The description includes all key required sections: a clear explanation of the issue and solution, test coverage details with specific pytest commands, and a completed PR checklist confirming guideline compliance.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Fix unified-memory Mamba KV estimation

38864aa

Signed-off-by: peter941221 <peter941221@gmail.com>

github-actions Bot assigned peter941221 Jun 10, 2026

peter941221 mentioned this pull request Jun 10, 2026

[Bug] Hybrid Mamba KV-cache estimation clamps max_tokens to 0 on integrated/unified-memory GPUs (regression in 1.3.0rc15, commit 091ad7b0) #15178

Open

4 tasks

peter941221 marked this pull request as ready for review June 10, 2026 09:16

peter941221 requested a review from a team as a code owner June 10, 2026 09:16

peter941221 requested a review from joyang-nv June 10, 2026 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#15178][fix] Fix unified-memory Mamba KV estimation#15215

[#15178][fix] Fix unified-memory Mamba KV estimation#15215
peter941221 wants to merge 1 commit into
NVIDIA:mainfrom
peter941221:fix/unified-mem-mamba-kv-est-clean

peter941221 commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter941221 commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Bug Fixes

Tests

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peter941221 commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading