Default node termination policy is inappropriate for StatefulSets

We recently introduced escalator into our platform as a means to auto scale worker nodes in EKS.  It has worked great for stateless workloads, realized as Kubernetes `Deployment`s. However, when attempting to use escalator for ASGs with `StatefulSet`s, we observed behavior that was, at least initially, unexpected.

When scaling down a `StatefulSet`, Kubernetes terminates pods in reverse order of the pod ordinal / index. As an example, for a `StatefulSet` with _n_ replicas, pod `n-1` is terminated first followed by `n-2` and so on. Given that we run a single instance of `StatefulSet` pods per EKS worker node, what we see is that Kubernetes scales down the _newest_ pod (pod with highest index) while escalator terminates the oldest node first. This results in the following sequence of events:

1. Kubernetes terminates pod `n-1` running on node `n-1` (one of the newer nodes).
2. escalator determines a scale down action is required.
3. escalator terminates node `0` (oldest node). This results in pod `0` being evicted and ending up in `PENDING` state.
4. Kubernetes reschedules pod `0` on node `n-1`.

As you can imagine, this sequence is disruptive as a stateful workload is being forcefully relocated when terminating the newly vacated node might have been sufficient. Reading through the escalator documentation seemed to suggest this was indeed expected. However, these two statements in particular seemed contradictory to each other:

https://github.com/atlassian/escalator/blob/master/docs/scale-process.md
> If hard_delete_grace_period is reached, the node will be terminated regardless, even if there are sacred pods running on it.

https://github.com/atlassian/escalator/blob/master/docs/configuration/nodegroup.md
> Remove any nodes that have already been tainted and have exceed the grace period and are considered empty.

In summary, I have the following comments / questions:
- Do you believe the docs need updating to modify the second statement above? It appears escalator does not exclude non-empty nodes when scaling down while the statement seems to suggest it terminates nodes only if empty.
- To work around the issue, we modified escalator to only consider empty nodes during scale down. Obviously, this violates assumptions in the rest of the code base and causes unit test failures. What would the most idiomatic fix to our problem? I suppose #105 would help but I don't see any progress on that ticket at this time.

Thanks in advance for any help on this topic and for an amazing product.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default node termination policy is inappropriate for StatefulSets #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Default node termination policy is inappropriate for StatefulSets #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions