Skip to content

Conversation

@sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Dec 3, 2025

Include detailed node exclusion reasons (taints, nodeSelector, affinity, resources) in TAS scheduling failure messages to improve debuggability.

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #7854

Special notes for your reviewer:

Does this PR introduce a user-facing change?

TAS: extend the information in condition messages and events about nodes excluded from calculating the
assignment due to various recognized reasons like: taints, node affinity, node resource constraints.

Copilot AI review requested due to automatic review settings December 3, 2025 05:40
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Dec 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Dec 3, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 54180f6
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69334de11424dd0008730350

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 3, 2025
Copilot finished reviewing on behalf of sohankunkerkar December 3, 2025 05:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances TAS (Topology Aware Scheduling) failure messages by adding detailed node exclusion statistics. When scheduling fails, users will now see why nodes were excluded (e.g., taints, nodeSelector mismatches, affinity rules, insufficient resources) along with counts for each exclusion reason.

Key Changes:

  • Introduced ExclusionStats to track and report node exclusion reasons during TAS scheduling
  • Enhanced failure messages to include total node count and breakdown of exclusion reasons
  • Updated 30+ test cases to validate the new detailed error messages

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/cache/scheduler/tas_flavor_snapshot.go Core implementation: Added ExclusionStats struct, limitingResource function for identifying bottleneck resources, and updated fillInCounts/notFitMessage to track and format exclusion statistics
pkg/scheduler/scheduler_tas_test.go Updated 16 test cases to match new detailed failure message format with exclusion statistics
pkg/cache/scheduler/tas_cache_test.go Updated 14 test cases and added 2 new test cases to validate exclusion stats formatting and resource bottleneck detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sohankunkerkar sohankunkerkar changed the title [WIP] pkg/cache/scheduler: add exclusion stats to TAS failure messages pkg/cache/scheduler: add exclusion stats to TAS failure messages Dec 5, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2025
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I love this PR, this will greatly improve debuggability.

It does not change the API other than messages so I would like to backport it.

cc @PBundyra @pajakd @mwysokin

@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2025

/remove-kind cleanup
/kind feature
/release-note-edit

TAS: extend the information in condition messages and events about nodes excluded from calculating the
assignment due to various recognized reasons like: taints, node affinity, node resource constraints.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. labels Dec 5, 2025
@sohankunkerkar sohankunkerkar force-pushed the tas-debug-information branch 2 times, most recently from e46a4e4 to 3c7ebd9 Compare December 5, 2025 20:56
Include detailed node exclusion reasons (taints, nodeSelector, affinity,
resources) in TAS scheduling failure messages to improve debuggability.

Fixes: kubernetes-sigs#7854

Signed-off-by: Sohan Kunkerkar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAS: Improve debuggability when scheduling fails

3 participants