-
Notifications
You must be signed in to change notification settings - Fork 61
Freeze scaling when unhealthy nodes found and remove them #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vincentportella
merged 11 commits into
master
from
vportella/add-nodegroup-health-checking
Sep 4, 2025
+1,005
−93
Merged
Changes from 8 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
fa058da
Taint/terminate nodes which do not become ready in time
vincentportella 109a2b0
Remove code for testing
vincentportella 7d084d1
Pause all scaling activitly when nodegroup is unhealthy and try termi…
vincentportella 31388ea
fix linting
vincentportella cd09b32
Remove stale comment
vincentportella 01befbe
Fixes based on comments
vincentportella b7609ea
Revert ScaleUp since no longer needs to be changed
vincentportella d471bb6
Updated documentation for clarify + fix more nits
vincentportella 49eb5c7
Fix unreachable logic with equivalent implementation
vincentportella 3255b01
Tweak max unhealthy nodes to make max 99% + add more tests
vincentportella 3042c3a
fix test name
vincentportella File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,12 +29,12 @@ node_groups: | |
| taint_effect: NoExecute | ||
| max_node_age: 24h | ||
| aws: | ||
| fleet_instance_ready_timeout: 1m | ||
| launch_template_id: lt-1a2b3c4d | ||
| launch_template_version: "1" | ||
| lifecycle: on-demand | ||
| instance_type_overrides: ["t2.large", "t3.large"] | ||
| resource_tagging: false | ||
| fleet_instance_ready_timeout: 1m | ||
| launch_template_id: lt-1a2b3c4d | ||
| launch_template_version: "1" | ||
| lifecycle: on-demand | ||
| instance_type_overrides: ["t2.large", "t3.large"] | ||
| resource_tagging: false | ||
| ``` | ||
|
|
||
| ## Options | ||
|
|
@@ -273,3 +273,29 @@ When not at the minimum, the natural scaling up and down of the node group will | |
| node group. | ||
|
|
||
| This is an optional feature and by default is disabled. | ||
|
|
||
| ### `unhealthy_node_grace_period` | ||
|
|
||
| Defines the minimum age of a node before it can be tested to check if it is unhealthy. | ||
|
|
||
| When enabled, instances can be tested periodically to determine if they are healthy. Escalator will pause all scaling activity and flush out unhealthy instances if they go above a configured threshold for the nodegroup. It will continuously do this until enough instances in the nodegroup are healthy and normal scaling activity can resume. | ||
|
|
||
| Cordoned nodes are skipped and can never be considered unhealthy. | ||
|
|
||
| This is an optional field. The default value is empty, which disables the feature. | ||
|
|
||
| ### `health_check_newest_nodes_percent` | ||
|
|
||
| **[Only used if `unhealthy_node_grace_period` is set.]** | ||
|
|
||
| The percentage of nodes (ordered by age from newer to older) in the nodegroup that are considered when checking for the maximum allowed unhealthy nodes in the nodegroup. The nodes captured by this percentage form the "test set" to be checked. Only nodes which are older than `unhealthy_node_grace_period` will be included in the test set. | ||
|
|
||
| This field is required. | ||
|
|
||
| ### `max_unhealthy_nodes_percent` | ||
|
|
||
| **[Only used if `unhealthy_node_grace_period` is set.]** | ||
|
|
||
| The maximum percentage of unhealthy nodes in the test set from `health_check_newest_nodes_percent`. Beyond this threshold all scaling activity is paused and unhealthy nodes are flushed out. | ||
|
|
||
| This is an optional field. If not set, it will default to `0%`. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify, if set to 0% this means any unhealthy node will pause scaling?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: list the default values for all of these for quick reference. As I believe these default to being turned off, list some good starting points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
health_check_newest_nodes_percentis the only one that is technically required ifunhealthy_node_grace_periodis set. I don't see the point to setting a default value forunhealthy_node_grace_periodandhealth_check_newest_nodes_percentbecause those are required to use the feature. Setting a default formax_unhealthy_nodes_percentagemakes sense because that one can still be used when not set