Skip to content

[None][chore] Fix lock_infra_error#15213

Open
yufeiwu-nv wants to merge 9 commits into
NVIDIA:mainfrom
yufeiwu-nv:bug
Open

[None][chore] Fix lock_infra_error#15213
yufeiwu-nv wants to merge 9 commits into
NVIDIA:mainfrom
yufeiwu-nv:bug

Conversation

@yufeiwu-nv

@yufeiwu-nv yufeiwu-nv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged.

Signed-off-by: yufeiwu-nv 230315618+yufeiwu-nv@users.noreply.github.com

@coderabbitai summary

Description

Problem

config_file_lock() re-raises filelock.Timeout instead of using its tempdir fallback. The errno-narrowing added in #11960isinstance(e, OSError) and e.errno not in {EACCES, EPERM, ENOLCK, ESTALE} — was meant to let non-lock OSErrors propagate. But filelock.Timeout is an OSError subclass with errno=None, so it satisfies that condition and gets re-raised, defeating the very lock-acquisition-timeout fallback the function is supposed to provide.

Impact

When multiple ranks load a trust_remote_code model concurrently (tp/ep > 1), they contend on the single global _remote_code.lock. The ranks that time out crash during executor init and trigger MPI_ABORT — observed on the deepseek_r1_0528_fp4 ... ep:4-tp:4 perf test.

Fix

Refactor config_file_lock into a single-yield context manager that guards only the acquire() call. The yield is moved into else + finally release, so exceptions raised by the caller body (e.g. HF RepositoryNotFoundError, also an OSError subclass) propagate cleanly without a second yield. Fallback-eligible failures are now selected via isinstance — matching filelock.Timeout and PermissionError explicitly, plus NFS errnos ENOLCK/ESTALE.

Verification

Reproduced with a real-filelock multi-process contention test (1 holder + 3 workers): the shipped logic makes all waiting workers raise Timeout, while the fix lets them all fall back and succeed.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…formance tests and configurations

- Updated model paths to include nemotron_3_ultra_550b_nvfp4 in HF_MODEL_PATH.
- Added configuration settings for nemotron_3_ultra_550b_nvfp4 in pytorch_model_config.py.
- Included new performance test cases for nemotron_3_ultra_550b_nvfp4 in test_perf.py and updated llm_perf_core.yml.
- Cleaned up legacy model name handling in test_perf.py.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
…ts for nemotron and llama models

- Reintroduced performance tests for nemotron_nano_12b_v2 and qwen3.5_27b models with various configurations.
- Added performance tests for llama_v3.3_nemotron_super_49b with multiple input/output lengths and GPU configurations.
- Ensured comprehensive coverage of performance benchmarks in the llm_perf_core.yml file.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
- Removed redundant test cases for llama_v3.1_nemotron_ultra_253b and adjusted the configuration for qwen3.5_122b_a10b.
- Added back performance tests for llama_v3.1_nemotron_ultra_253b with various input/output lengths and GPU configurations.
- Updated comments for clarity on the test cases included.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Addresses CodeRabbit review: --log_level=info is a static literal and
does not need an f-string prefix (ruff F541).

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged.

Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Refactors the HuggingFace cache lock acquisition to classify infrastructure errors and implement fallback logic. Adds a helper function to identify timeout and permission-related lock failures, then updates the context manager to acquire locks explicitly and conditionally retry with a tempdir-based lock when primary acquisition fails for infrastructure reasons.

Changes

Lock infrastructure improvement

Layer / File(s) Summary
Lock error classification and fallback locking
tensorrt_llm/_torch/model_config.py
Introduces _is_lock_infra_error() to classify timeout, permission, and specific OSError failures as fallback-eligible. Refactors config_file_lock to use explicit acquire()/release() calls, conditionally acquire a tempdir lock on infrastructure errors, and proceed without a lock if both primary and fallback acquisitions fail due to infrastructure issues.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][chore] Fix lock_infra_error' follows the template format and clearly describes the main change—adding infrastructure for handling lock errors—though 'fix' may be more descriptive than 'chore'.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively explains the problem, impact, fix, and verification with clear sections following the template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yufeiwu-nv yufeiwu-nv removed request for a team, dongxuy04 and yizhang-nv June 10, 2026 08:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 72-73: The helper _is_lock_infra_error currently treats
filelock.Timeout as an infrastructure failure; update that function to stop
classifying filelock.Timeout as an infra error (it is an acquisition timeout,
not broken lock infra). Locate _is_lock_infra_error and remove filelock.Timeout
from the isinstance check so only true infrastructure errors (e.g.,
PermissionError or other genuine file-locking exceptions you want to keep)
trigger the tempdir/no-lock fallback paths; ensure PermissionError handling
remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b0c3216-d3ec-4a6d-a146-9bb344730ee7

📥 Commits

Reviewing files that changed from the base of the PR and between 90cb7ff and 2784114.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/model_config.py

Comment thread tensorrt_llm/_torch/model_config.py Outdated
Comment on lines +72 to +73
if isinstance(exc, (filelock.Timeout, PermissionError)):
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In filelock (Python), under what conditions is filelock.Timeout raised by FileLock.acquire(), and is it intended to represent lock contention vs infrastructure failure?

💡 Result:

FileLock.acquire raises filelock.Timeout when it cannot acquire the lock within the configured timeout period, or when acquire is called with blocking=False (or an equivalent “non-blocking” mode) and the lock is already held (so acquisition fails immediately). The API reference and exception doc both describe it as an acquisition-time failure due to timeout, not as a signal of infrastructure/IO failure. [1][2] More specifically: - Timeout is raised “if the lock could not be acquired within timeout seconds.” [2] - In BaseFileLock.acquire, the documentation states Timeout is raised “if the lock cannot be acquired within the specified timeout or if blocking is False and the lock is held.” [1] - The docs also describe timeout semantics: blocking until the timeout expires or the lock is acquired; blocking=False returns immediately if it cannot obtain the lock on the first attempt. [1] Is it intended to represent lock contention vs infrastructure failure? - In intent and wording, filelock.Timeout represents lock acquisition failure (i.e., contention/availability of the lock), because it is explicitly tied to not being able to acquire the lock within the timeout window / on the first non-blocking attempt. [1][2] - The exception class docstring itself does not mention infrastructure failure; it frames the problem solely as inability to acquire the lock in the allotted time. [2] Note: other errors (e.g., OS/filesystem problems while creating/locking the lockfile) would typically surface as different exceptions; filelock.Timeout is specifically the timeout/acquisition-failure exception described above. (This follows directly from the documented meaning of filelock.Timeout rather than a general catch-all.) [1][2]

Citations:


Do not treat filelock.Timeout as a lock-infrastructure failure.

In tensorrt_llm/_torch/model_config.py, _is_lock_infra_error() currently treats filelock.Timeout as infra failure, which can route execution into the tempdir (or no-lock) fallback paths around lines 100–114. filelock.Timeout is an acquisition-time failure (lock contention / couldn’t acquire within the timeout), not broken locking infrastructure—so it shouldn’t trigger fallback.

Suggested fix
 def _is_lock_infra_error(exc: BaseException) -> bool:
     """Whether exc is a lock-infrastructure failure eligible for tempdir fallback."""
-    # filelock.Timeout subclasses OSError with errno=None, so match it explicitly.
-    if isinstance(exc, (filelock.Timeout, PermissionError)):
+    if isinstance(exc, PermissionError):
         return True
+    # Timeout is typically contention, not broken lock infra.
+    if isinstance(exc, filelock.Timeout):
+        return False
     if isinstance(exc, OSError):
         return exc.errno in (errno.EACCES, errno.EPERM, errno.ENOLCK,
                              errno.ESTALE)
     return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/model_config.py` around lines 72 - 73, The helper
_is_lock_infra_error currently treats filelock.Timeout as an infrastructure
failure; update that function to stop classifying filelock.Timeout as an infra
error (it is an acquisition timeout, not broken lock infra). Locate
_is_lock_infra_error and remove filelock.Timeout from the isinstance check so
only true infrastructure errors (e.g., PermissionError or other genuine
file-locking exceptions you want to keep) trigger the tempdir/no-lock fallback
paths; ensure PermissionError handling remains unchanged.

Updated the logic in the _is_lock_infra_error function to better differentiate between lock contention and broken infrastructure. Enhanced the config_file_lock context manager to log warnings appropriately when lock acquisition fails, ensuring clearer error handling and fallback behavior.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant