-
Notifications
You must be signed in to change notification settings - Fork 479
Add StatefulSet as owner of Workload. #4799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StatefulSet as owner of Workload. #4799
Conversation
|
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
8cffe93 to
d8935ee
Compare
d8935ee to
8634d41
Compare
422353b to
28a1799
Compare
|
/cc @mimowo |
7245480 to
4422733
Compare
| shouldUpdate = true | ||
| err = controllerutil.RemoveOwnerReference(sts, wl, r.client.Scheme()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe why we remove the owner reference from the workload on scaling down the StatefulSet?
IIUC this will result in deleting the workload, right? It seems safer to just update the count in the workload object, similarly as we would do for Jobs. For example, when we suspend a Job we don't delete the workload IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, when we suspend a Job we don't delete the workload IIRC.
It is a very interesting behavior in Job. I tried to suspend the job manually, but it automatically unsuspends itself. The only way to suspend the job is to deactivate the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe why we remove the owner reference from the workload on scaling down the StatefulSet?
The problem is that after activation, the replicas can change (replicas=3 → replicas=0 → replicas=5). Our workload is admitted, and we can't change the PodSet count since this field is immutable. I think it's much easier to just delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this will result in deleting the workload, right?
Yeah, you right. It should delete workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for Job when the count is updated, then we update the count in workload.
So, my natural preference is to also update the count in the workload for STS, not sure it is much more complex. I'm not sure myself - considering the options.
The problem is that after activation, the replicas can change (replicas=3 → replicas=0 → replicas=5). Our workload is admitted, and we can't change the PodSet count since this field is immutable. I think it's much easier to just delete it.
I see, but to me it suggests we should also evict the workload for STS in that case, to be consistent with Job.
Can you try to prototype, maybe in a separate PR so that we can compare. I don't think we need to rush with decisions. I would prefer to do it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the investigation and answers. The PR looks good assuming we take the approach of deleting the workload, but I would prefer to explore keeping and updating the workload, as we do for Jobs.
356b381 to
5b20f33
Compare
|
/hold |
| // Finished means whether the job is completed/failed or not, | ||
| // condition represents the workload finished condition. | ||
| func (p *Pod) Finished() (message string, success, finished bool) { | ||
| if p.isServing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bugfix for an issue on its own. Let me open so that we can fix issues one-by-one rather than multiple in one PR, because then some overlapping issues make it hard to reason about the solution. Solving one-by-one will also allow us to build a good test coverage. Opened: #4805
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move these changes to a separate PR, but I'm not sure how to test them yet. We still need the the OwnerReference of the StatefulSet to prevent it from being deleted.
|
/remove-lifecycle rotten |
cb269ed to
9568409
Compare
9568409 to
367ac58
Compare
|
/release-note-edit |
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have revisted this PR. I take it.
cc @sohankunkerkar I know you prepared the alternative, but I think we need to go with one. Since both PRs implement pretty much the same approach it is a great signal the approach makes sense, because both of you converged on the same approach. Thank you folks 👍
/lgtm
/approve
/cherrypick release-0.14
/cherrypick release-0.13
|
LGTM label has been added. Git tree hash: 0dfe16eb2f5ecd05a0be0160bf60fd404f68d48d
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mbobrovskyi, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
|
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4799/pull-kueue-populator-test-e2e-main/1996989187164737536 |
|
/cherrypick release-0.14 |
|
@mimowo: new pull request created: #8102 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #8103 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…#8104) * Add StatefulSet as owner of Workload. * Fix after cherry-pick.
|
/cherrypick release-0.15 |
|
@mimowo: new pull request created: #8105 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Add StatefulSet as owner of Workload.
Which issue(s) this PR fixes:
Fixes #4342
Fixes #4805
Fixes #6883
Fixes #7366
Special notes for your reviewer:
Does this PR introduce a user-facing change?