Skip to content

Conversation

@vincentportella
Copy link
Member

@vincentportella vincentportella commented Aug 27, 2025

  • Added a new configurable test to determine if a nodegroup is unhealthy
  • Added a new metric to report if a nodegroup is unhealthy
  • When a nodegroup is unhealthy, pause all scaling activity
  • Taint and remove the unhealthy nodes which will trigger the nodegroup to become healthy again
  • Rinse and repeat until healthy nodes come up again

@vincentportella vincentportella changed the title Freeze scaling when unhealthy found and remove them Freeze scaling when unhealthy nodes found and remove them Aug 27, 2025
Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good, just a lot of minor changes I'd like to see. Nice work


This is an optional feature and by default is disabled.

### `unhealthy_node_grace_period`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: list the default values for all of these for quick reference. As I believe these default to being turned off, list some good starting points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health_check_newest_nodes_percent is the only one that is technically required if unhealthy_node_grace_period is set. I don't see the point to setting a default value for unhealthy_node_grace_period and health_check_newest_nodes_percent because those are required to use the feature. Setting a default for max_unhealthy_nodes_percentage makes sense because that one can still be used when not set

TaintEffect: "invalid",
MaxNodeAge: "bla",
UnhealthyNodeGracePeriod: "bla",
MaxUnhealthyNodesPercent: 101,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: max being 101 would imply that the value can be set to 101%. Is that true? If you name the variable max I'd prefer a <= comparison rather than < 101. It makes more logical sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how it is set in node_group.go. This is the test here to make sure that 101 is higher. I don't follow the concern here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was a test value rather than the real comparison value. You wanted it to be invalid - I missed that.


The maximum percentage of unhealthy nodes in the test set from `health_check_newest_nodes_percent`. Beyond this threshold all scaling activity is paused and unhealthy nodes are flushed out.

This is an optional field. If not set, it will default to `0%`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, if set to 0% this means any unhealthy node will pause scaling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if err != nil {
log.Errorf("Failed to add nodes because of an error. Skipping cloud provider node group scaleup: %v", err)
return 0, err
if opts.nodesDelta > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this if statement is redundant, it is the same as the outer if statement.

}
if opts.nodesDelta > 0 {
// check that untainting the nodes doesn't do bring us over max nodes
if opts.nodesDelta <= 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this can never be hit, is this the correct logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it back to the original implementation, but yeah that can never be hit. I'll refactor to an implementation which is equivalent

Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just the minor nit around unreachable if logic that should be looked at

mwhittington21
mwhittington21 previously approved these changes Sep 4, 2025
Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@mwhittington21 mwhittington21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vincentportella vincentportella merged commit 5c795cf into master Sep 4, 2025
6 checks passed
@vincentportella vincentportella deleted the vportella/add-nodegroup-health-checking branch September 4, 2025 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants