Freeze scaling when unhealthy nodes found and remove them #271

vincentportella · 2025-08-27T07:47:39Z

Added a new configurable test to determine if a nodegroup is unhealthy
Added a new metric to report if a nodegroup is unhealthy
When a nodegroup is unhealthy, pause all scaling activity
Taint and remove the unhealthy nodes which will trigger the nodegroup to become healthy again
Rinse and repeat until healthy nodes come up again

…nating tainted unhealthy nodes

mwhittington21

Logic looks good, just a lot of minor changes I'd like to see. Nice work

mwhittington21 · 2025-08-28T06:21:00Z

docs/configuration/nodegroup.md


 This is an optional feature and by default is disabled.
+
+### `unhealthy_node_grace_period`


nit: list the default values for all of these for quick reference. As I believe these default to being turned off, list some good starting points.

health_check_newest_nodes_percent is the only one that is technically required if unhealthy_node_grace_period is set. I don't see the point to setting a default value for unhealthy_node_grace_period and health_check_newest_nodes_percent because those are required to use the feature. Setting a default for max_unhealthy_nodes_percentage makes sense because that one can still be used when not set

pkg/controller/scale_down.go

pkg/controller/controller.go

pkg/controller/util_test.go

pkg/controller/util.go

mwhittington21 · 2025-08-28T07:02:22Z

pkg/controller/node_group_test.go

 					TaintEffect:                        "invalid",
 					MaxNodeAge:                         "bla",
+					UnhealthyNodeGracePeriod:           "bla",
+					MaxUnhealthyNodesPercent:           101,


issue: max being 101 would imply that the value can be set to 101%. Is that true? If you name the variable max I'd prefer a <= comparison rather than < 101. It makes more logical sense.

That is how it is set in node_group.go. This is the test here to make sure that 101 is higher. I don't follow the concern here.

You're right, it was a test value rather than the real comparison value. You wanted it to be invalid - I missed that.

pkg/controller/scale_up.go

mwhittington21 · 2025-09-02T06:45:47Z

docs/configuration/nodegroup.md

+
+The maximum percentage of unhealthy nodes in the test set from `health_check_newest_nodes_percent`. Beyond this threshold all scaling activity is paused and unhealthy nodes are flushed out.
+
+This is an optional field. If not set, it will default to `0%`.


Just to clarify, if set to 0% this means any unhealthy node will pause scaling?

mwhittington21 · 2025-09-02T06:59:48Z

pkg/controller/scale_up.go

-	if err != nil {
-		log.Errorf("Failed to add nodes because of an error. Skipping cloud provider node group scaleup: %v", err)
-		return 0, err
+		if opts.nodesDelta > 0 {


nit: this if statement is redundant, it is the same as the outer if statement.

mwhittington21 · 2025-09-02T07:00:19Z

pkg/controller/scale_up.go

-	}
+	if opts.nodesDelta > 0 {
+		// check that untainting the nodes doesn't do bring us over max nodes
+		if opts.nodesDelta <= 0 {


issue: this can never be hit, is this the correct logic?

I reverted it back to the original implementation, but yeah that can never be hit. I'll refactor to an implementation which is equivalent

mwhittington21

lgtm, just the minor nit around unreachable if logic that should be looked at

mwhittington21

lgtm

mwhittington21

lgtm

vincentportella added 3 commits August 27, 2025 16:11

Taint/terminate nodes which do not become ready in time

fa058da

Remove code for testing

109a2b0

Pause all scaling activitly when nodegroup is unhealthy and try termi…

7d084d1

…nating tainted unhealthy nodes

vincentportella requested review from MinyiZ, awprice and mwhittington21 August 27, 2025 07:47

vincentportella self-assigned this Aug 27, 2025

fix linting

31388ea

vincentportella changed the title ~~Freeze scaling when unhealthy found and remove them~~ Freeze scaling when unhealthy nodes found and remove them Aug 27, 2025

Remove stale comment

cd09b32

mwhittington21 requested changes Aug 28, 2025

View reviewed changes

vincentportella added 3 commits September 2, 2025 10:49

Fixes based on comments

01befbe

Revert ScaleUp since no longer needs to be changed

b7609ea

Updated documentation for clarify + fix more nits

d471bb6

mwhittington21 reviewed Sep 2, 2025

View reviewed changes

mwhittington21 requested changes Sep 2, 2025

View reviewed changes

Fix unreachable logic with equivalent implementation

49eb5c7

mwhittington21 previously approved these changes Sep 4, 2025

View reviewed changes

Tweak max unhealthy nodes to make max 99% + add more tests

3255b01

vincentportella dismissed mwhittington21’s stale review via 3255b01 September 4, 2025 00:30

fix test name

3042c3a

mwhittington21 approved these changes Sep 4, 2025

View reviewed changes

MinyiZ approved these changes Sep 4, 2025

View reviewed changes

vincentportella merged commit 5c795cf into master Sep 4, 2025
6 checks passed

vincentportella deleted the vportella/add-nodegroup-health-checking branch September 4, 2025 00:45


		This is an optional feature and by default is disabled.

		### `unhealthy_node_grace_period`


		The maximum percentage of unhealthy nodes in the test set from `health_check_newest_nodes_percent`. Beyond this threshold all scaling activity is paused and unhealthy nodes are flushed out.

		This is an optional field. If not set, it will default to `0%`.

Freeze scaling when unhealthy nodes found and remove them #271

Freeze scaling when unhealthy nodes found and remove them #271

Uh oh!

Conversation

vincentportella commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwhittington21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwhittington21 left a comment

Choose a reason for hiding this comment

Uh oh!

mwhittington21 left a comment

Choose a reason for hiding this comment

Uh oh!

mwhittington21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vincentportella commented Aug 27, 2025 •

edited

Loading