Skip to content

Default node termination policy is inappropriate for StatefulSets #177

@akshayks

Description

@akshayks

We recently introduced escalator into our platform as a means to auto scale worker nodes in EKS. It has worked great for stateless workloads, realized as Kubernetes Deployments. However, when attempting to use escalator for ASGs with StatefulSets, we observed behavior that was, at least initially, unexpected.

When scaling down a StatefulSet, Kubernetes terminates pods in reverse order of the pod ordinal / index. As an example, for a StatefulSet with n replicas, pod n-1 is terminated first followed by n-2 and so on. Given that we run a single instance of StatefulSet pods per EKS worker node, what we see is that Kubernetes scales down the newest pod (pod with highest index) while escalator terminates the oldest node first. This results in the following sequence of events:

  1. Kubernetes terminates pod n-1 running on node n-1 (one of the newer nodes).
  2. escalator determines a scale down action is required.
  3. escalator terminates node 0 (oldest node). This results in pod 0 being evicted and ending up in PENDING state.
  4. Kubernetes reschedules pod 0 on node n-1.

As you can imagine, this sequence is disruptive as a stateful workload is being forcefully relocated when terminating the newly vacated node might have been sufficient. Reading through the escalator documentation seemed to suggest this was indeed expected. However, these two statements in particular seemed contradictory to each other:

https://github.com/atlassian/escalator/blob/master/docs/scale-process.md

If hard_delete_grace_period is reached, the node will be terminated regardless, even if there are sacred pods running on it.

https://github.com/atlassian/escalator/blob/master/docs/configuration/nodegroup.md

Remove any nodes that have already been tainted and have exceed the grace period and are considered empty.

In summary, I have the following comments / questions:

  • Do you believe the docs need updating to modify the second statement above? It appears escalator does not exclude non-empty nodes when scaling down while the statement seems to suggest it terminates nodes only if empty.
  • To work around the issue, we modified escalator to only consider empty nodes during scale down. Obviously, this violates assumptions in the rest of the code base and causes unit test failures. What would the most idiomatic fix to our problem? I suppose Different node selection methods for termination #105 would help but I don't see any progress on that ticket at this time.

Thanks in advance for any help on this topic and for an amazing product.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions