-
Notifications
You must be signed in to change notification settings - Fork 479
Add support for JobSet DependsOn with PodsReady #7889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for JobSet DependsOn with PodsReady #7889
Conversation
|
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
kannon92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mimowo
Is this aligned with the idea you have for this feature?
|
TBH, I don't think this is a feature, it is more of a bugfix after DependsOn was introduced into JobSet.
I will try to check that in more detail tomorrow, but possibly only after release 0.15.0 |
|
It would be great to have a e2e test demonstrating this works for waitForPodsReady. |
e2e might be possible but tricky because the nature of the feature is timing-based, so I'm worried about flakes. I would start with unit and integration |
26585c7 to
2ccaac0
Compare
|
/hold |
|
/test pull-kueue-test-integration-baseline-main |
|
This should have a release note. |
2ccaac0 to
1a70b19
Compare
2bb5d5e to
6457364
Compare
0963962 to
56f8699
Compare
|
/hold cancel |
56f8699 to
5f64777
Compare
When a dependsOn is defined on a JobSet and the waitForPodsReady is configured in Kueue, the PodsReady condition should be udpated only for the Jobs that are started, which is not the case currently, all the jobs, including the ones with dependsOn that are not started are being considered when checking if the respective Pods are ready. This commit fixes the issue by ensuring that whenever there is some updated in the job, it means the job has started to be managed and it should be taken into account when defining if all the Pods are ready.
5f64777 to
a82cad0
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MaysaMacedo, olekzabl The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| status, ok := jobsStatus[replicatedJob.Name] | ||
| // Register the amount of expected replicas from jobs that already has | ||
| // some status updated or doesn't have a dependsOn. | ||
| if replicatedJob.DependsOn == nil || (ok && status.Active+status.Failed+status.Ready+status.Succeeded+status.Suspended > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we here changes are required:
replicatedJob.DependsOn == nilreplicatedJob.DependsOn == Ready && status.Ready+status.Succeeded >= spec.ReplicatedJobs[idx of the Job]replicatedJob.DependsOn=Completed && status.Succeeded >= spec.ReplicatedJobs[idx of the Job]
So:
- is easy, as before
- we delay the check until the downstream job is Ready, which in practice should be quick
- we delay the check until the downstream job is Complete, this can take arbitrarily long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow your recommendation.
Why do we need to differentiate between a job having DependsOn ready or complete?
isn't it enough to verify if any status has changed for that job and that would mean the dependsOn dependency was satisfied and job is started, consequently that job should be taken into account to define the podsReady?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I might be missing something as well, but this is my understanding.
Consider this example:
- JobA dependsOn=Complete JobB
- JobB is running for 1h, so JobA hasn't started yet
Now question: should we declare the JobSet as PodsReady=true, or PodsReady=false?
I was initially thinking that in this case PodsReady=true, because the JobSet if making progress. If we declared PodsReady=False in this case, then it would require the administrator to set waitForPodsReady.timeout > 1h.
OTOH, maybe this case, JobB taking 1h is a bit artificial.
If we use my recommendation and declare PodsReady=true for the JobSet, then in this case the JobSet may transition PodsReady=true to PodsReady=false once the JobB is complete. This admin could compensate by increasing the waitForPodsReady.recoveryTimeout, but it is also not indeed the primary use case of the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version of this PR is achieving your idea I think:
- Job B is running
- Job A dependsOn JobB complete
as the only job running is job B the status is podReady=true, once that is completed jobA starts and podReady condition might transition to False if the PodA taking a bit to be ready, for example downloading image, and then it transitions to podReady when the podA is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale I tried to follow in my implementation is:
The jobs that should be taken into account to set podsReady are the ones that have started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, can you point me to a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, interesting idea. I will continue tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jobs that should be taken into account to set podsReady are the ones that have started.
when I mentioned have started, I meant that the job has any status updates for status.Active+status.Failed+status.Ready+status.Succeeded+status.Suspended.
From what I tested, If the job hasn't started to be handled it would have all as 0 in the status.
What type of PR is this?
/kind feature
What this PR does / why we need it:
When a
dependsOnis defined on a JobSet and thewaitForPodsReadyis configured in Kueue, the PodsReady condition should be udpated
only for the Jobs that are started, which is not the case currently,
all the jobs, including the ones with dependsOn that are not started
are being considered when checking if the respective Pods are ready.
This commit fixes the issue by ensuring that whenever there is some
updated in the job, it means the job has started to be managed and
it should be taken into account when defining if all the Pods are
ready.
Which issue(s) this PR fixes:
Fixes #6796
Special notes for your reviewer:
Does this PR introduce a user-facing change?