Skip to content

Conversation

@sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Dec 2, 2025

When changing the queue name on a StatefulSet, the webhook propagates the change to the pod template, triggering a rolling update. Previously, the workload was only owned by pods, so when all pods were deleted during the rolling update, Kubernetes GC would delete the workload. This fix adds the StatefulSet as an additional owner of the workload,
preventing GC from deleting it when pods are replaced. This is similar to how LeaderWorkerSet sets itself as the owner of the workload.

What type of PR is this?

/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #6883

Special notes for your reviewer:

I reproduced this locally while working on some other issue (not related to Statefulset at all)

Does this PR introduce a user-facing change?

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 2, 2025
@k8s-ci-robot k8s-ci-robot requested a review from PBundyra December 2, 2025 17:50
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from tenzen-y December 2, 2025 17:50
@netlify
Copy link

netlify bot commented Dec 2, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit f1f48ba
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6930d910172c74000832e2f6

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 2, 2025
@sohankunkerkar
Copy link
Member Author

Without this change, you can reproduce this issue by running this following command:

$ make test-e2e GINKGO_ARGS='--focus=ReadyReplicas=0 --repeat=50'

I saw this in the logs: (I'm going to parse out the details)

  # Pod created and workload created with INVALID queue
  18:51:55.880  Workload create event  queue="sts-lq-invalid"
  18:51:55.880  localQueue doesn't exist or inactive

  # Test updates queue label on StatefulSet (at 18:51:56.821)

  # Pod gets suspended because workload not yet admitted
  18:51:56.875  Running job is not admitted by a cluster queue, suspending
  18:51:56.901  Not admitted by cluster queue  reason="Stopped"

  # Queue updated and workload admitted
  18:51:56.953  Workload update event  queue="sts-lq"  prevQueue="sts-lq-invalid"
  18:51:56.978  Workload update event  queue="sts-lq"  status="admitted"  clusterQueue="sts-cq"
  18:51:56.979  Job admitted, unsuspending

  # Workload gets DELETED right after being admitted
  18:51:57.000  Workload delete event  queue="sts-lq"  status="admitted"
  18:51:57.000  job with no matching workload, suspending
  18:51:57.042  Missing Workload; unable to restore pod templates  reason="Stopped"

  # New workload created, but pods already in bad state
  18:51:57.068  Workload create event  queue="sts-lq"  status="pending"
  18:51:57.111  Workload update event  queue="sts-lq"  status="admitted"
  18:51:59.055  Job admitted, unsuspending

  # Test times out waiting for ReadyReplicas=3 (stays at 0)

@sohankunkerkar sohankunkerkar changed the title test/e2e: fix race in StatefulSet queue name update test [WIP] test/e2e: fix race in StatefulSet queue name update test Dec 2, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2025
@sohankunkerkar
Copy link
Member Author

It needs more testing
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2025
@sohankunkerkar
Copy link
Member Author

After investigating, I suspect the flake is caused by a Workload being deleted and immediately recreated after admission. Tracing the deletion path points to FindMatchingWorkloads in pkg/controller/jobs/pod/pod_controller.go, which maps Workloads to Pods.
It seems to be a race during StatefulSet rolling updates: when the queue name changes, the controller updates the Workload’s expected RoleHash before the new Pods appear in the informer cache. In that gap, FindMatchingWorkloads sees the updated Workload but only the old Pods. Because the code checks len(keptPods) == 0, it treats the Workload as invalid and deletes it.

I tested locally and found that removing the len(keptPods) == 0 branch preserves the Workload during the transition, eliminating the delete–recreate loop. I want to understand why this check was added and what the impact of removing it might be.

Copilot AI review requested due to automatic review settings December 3, 2025 22:54
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 3, 2025
@sohankunkerkar sohankunkerkar changed the title [WIP] test/e2e: fix race in StatefulSet queue name update test pkg/controller: Fix workload deletion during StatefulSet queue name change Dec 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition in StatefulSet reconciliation by ensuring the StatefulSet is added as an owner of its workload early in the reconciliation process. Workloads for StatefulSets are created by the Pod controller when pods are created, and initially only have pod owner references. Without the StatefulSet as an owner, workloads could be garbage collected during rolling updates when pods are replaced, particularly during queue label updates. The fix refactors the reconciler to fetch the StatefulSet once and call ensureWorkloadOwnership before finalizing pods.

Key changes:

  • Added ensureWorkloadOwnership function to add StatefulSet as workload owner early in reconciliation
  • Refactored fetchAndFinalizePods to accept StatefulSet parameter instead of fetching it internally
  • Removed duplicate StatefulSet fetch logic from fetchAndFinalizePods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When changing the queue name on a StatefulSet, the webhook propagates
the change to the pod template, triggering a rolling update. Previously,
the workload was only owned by pods, so when all pods were deleted
during the rolling update, Kubernetes GC would delete the workload.

This fix adds the StatefulSet as an additional owner of the workload,
preventing GC from deleting it when pods are replaced. This is similar
to how LeaderWorkerSet sets itself as the owner of the workload.

Signed-off-by: Sohan Kunkerkar <[email protected]>
@sohankunkerkar
Copy link
Member Author

/test pull-kueue-test-e2e-customconfigs-main

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Dec 4, 2025

We already have this fix #4799. Still waiting for review from @mimowo.

@mimowo
Copy link
Contributor

mimowo commented Dec 4, 2025

@mbobrovskyi is this PR also fixing the issue? It looks simpler, and like the commentary and logging here helping me to understand. Part of the reason I haven't been able to quickly review the previous is complexity.

So maybe we can all work together on identifying the similarities and differences and picking up the best of both PRs.

I will try to find time to review today or tomorrow.

@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2025

@sohankunkerkar @mbobrovskyi I'm going to review the PRs today and decide what next. But it seems they both address #6883 so please mark this as "Fixed" in the PR descriptions

@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2025

/close
I decided to go with #4799, as it is more mature and with tests. It also takes a bit more care about parallel operations. In any case, thank you for the effort. It makes me confident in that way of fixing it.

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closed this PR.

In response to this:

/close
I decided to go with #4799, as it is more mature and with tests. It also takes a bit more care about parallel operations. In any case, thank you for the effort. It makes me confident in that way of fixing it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/flake Categorizes issue or PR as related to a flaky test. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flaky test] StatefulSet created should allow to change queue name if ReadyReplicas

4 participants