Skip to content

Conversation

@MaysaMacedo
Copy link
Contributor

@MaysaMacedo MaysaMacedo commented Nov 25, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

When a dependsOn is defined on a JobSet and the waitForPodsReady
is configured in Kueue, the PodsReady condition should be udpated
only for the Jobs that are started, which is not the case currently,
all the jobs, including the ones with dependsOn that are not started
are being considered when checking if the respective Pods are ready.
This commit fixes the issue by ensuring that whenever there is some
updated in the job, it means the job has started to be managed and
it should be taken into account when defining if all the Pods are
ready.

Which issue(s) this PR fixes:

Fixes #6796

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Adds support for PodsReady when JobSet dependsOn is used.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 25, 2025
@netlify
Copy link

netlify bot commented Nov 25, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit a82cad0
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6932eae50bb2070008942e17

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 25, 2025
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mimowo

Is this aligned with the idea you have for this feature?

@mimowo
Copy link
Contributor

mimowo commented Nov 25, 2025

TBH, I don't think this is a feature, it is more of a bugfix after DependsOn was introduced into JobSet.

Is this aligned with the idea you have for this feature?

I will try to check that in more detail tomorrow, but possibly only after release 0.15.0

@kannon92
Copy link
Contributor

It would be great to have a e2e test demonstrating this works for waitForPodsReady.

@mimowo
Copy link
Contributor

mimowo commented Nov 25, 2025

It would be great to have a e2e test demonstrating this works for waitForPodsReady.

e2e might be possible but tricky because the nature of the feature is timing-based, so I'm worried about flakes. I would start with unit and integration

@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch from 26585c7 to 2ccaac0 Compare December 1, 2025 15:29
@MaysaMacedo MaysaMacedo marked this pull request as ready for review December 1, 2025 15:29
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kannon92 December 1, 2025 15:30
@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 1, 2025
@MaysaMacedo
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2025
@MaysaMacedo
Copy link
Contributor Author

/test pull-kueue-test-integration-baseline-main

@kannon92
Copy link
Contributor

kannon92 commented Dec 1, 2025

This should have a release note.

@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch from 2ccaac0 to 1a70b19 Compare December 1, 2025 23:21
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 2, 2025
@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch 2 times, most recently from 2bb5d5e to 6457364 Compare December 3, 2025 22:09
@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch 6 times, most recently from 0963962 to 56f8699 Compare December 4, 2025 12:12
@MaysaMacedo
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2025
@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch from 56f8699 to 5f64777 Compare December 4, 2025 12:38
When a dependsOn is defined on a JobSet and the waitForPodsReady
is configured in Kueue, the PodsReady condition should be udpated
only for the Jobs that are started, which is not the case currently,
all the jobs, including the ones with dependsOn that are not started
are being considered when checking if the respective Pods are ready.
This commit fixes the issue by ensuring that whenever there is some
updated in the job, it means the job has started to be managed and
it should be taken into account when defining if all the Pods are
ready.
@MaysaMacedo MaysaMacedo force-pushed the support-jobset-waitforpodready branch from 5f64777 to a82cad0 Compare December 5, 2025 14:23
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MaysaMacedo, olekzabl
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

status, ok := jobsStatus[replicatedJob.Name]
// Register the amount of expected replicas from jobs that already has
// some status updated or doesn't have a dependsOn.
if replicatedJob.DependsOn == nil || (ok && status.Active+status.Failed+status.Ready+status.Succeeded+status.Suspended > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we here changes are required:

  1. replicatedJob.DependsOn == nil
  2. replicatedJob.DependsOn == Ready && status.Ready+status.Succeeded >= spec.ReplicatedJobs[idx of the Job]
  3. replicatedJob.DependsOn=Completed && status.Succeeded >= spec.ReplicatedJobs[idx of the Job]

So:

  1. is easy, as before
  2. we delay the check until the downstream job is Ready, which in practice should be quick
  3. we delay the check until the downstream job is Complete, this can take arbitrarily long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow your recommendation.
Why do we need to differentiate between a job having DependsOn ready or complete?
isn't it enough to verify if any status has changed for that job and that would mean the dependsOn dependency was satisfied and job is started, consequently that job should be taken into account to define the podsReady?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I might be missing something as well, but this is my understanding.

Consider this example:

  1. JobA dependsOn=Complete JobB
  2. JobB is running for 1h, so JobA hasn't started yet

Now question: should we declare the JobSet as PodsReady=true, or PodsReady=false?

I was initially thinking that in this case PodsReady=true, because the JobSet if making progress. If we declared PodsReady=False in this case, then it would require the administrator to set waitForPodsReady.timeout > 1h.

OTOH, maybe this case, JobB taking 1h is a bit artificial.

If we use my recommendation and declare PodsReady=true for the JobSet, then in this case the JobSet may transition PodsReady=true to PodsReady=false once the JobB is complete. This admin could compensate by increasing the waitForPodsReady.recoveryTimeout, but it is also not indeed the primary use case of the timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of this PR is achieving your idea I think:

  • Job B is running
  • Job A dependsOn JobB complete

as the only job running is job B the status is podReady=true, once that is completed jobA starts and podReady condition might transition to False if the PodA taking a bit to be ready, for example downloading image, and then it transitions to podReady when the podA is ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale I tried to follow in my implementation is:
The jobs that should be taken into account to set podsReady are the ones that have started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can you point me to a test for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, interesting idea. I will continue tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@MaysaMacedo MaysaMacedo Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs that should be taken into account to set podsReady are the ones that have started.

when I mentioned have started, I meant that the job has any status updates for status.Active+status.Failed+status.Ready+status.Succeeded+status.Suspended.
From what I tested, If the job hasn't started to be handled it would have all as 0 in the status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JobSet integration: Support DependsOn for PodsReady

5 participants