Using Cluster Autoscaler

Introduction

The Cluster Autoscaler is a tool designed to automatically adjust the size of a cluster, scaling nodes in or out depending on the demand. When pending pods without an available node are present, the autoscaler adds new nodes. Conversely, when nodes are underutilized and workloads can be efficiently run on fewer nodes, it will remove nodes.

To maximize the resource optimization and cost efficiency in an automated way, you can create Horizontal Pod Autoscalers (allowing you to set resource requests), which results in pending pods when there are no nodes with available resources.

Syself Autopilot provides seamless integration with Cluster Autoscaler on Cluster API .

Supported Configuration

Syself Autopilot supports the Cluster Autoscaler deployment within the workload cluster using service account credentials. This is achieved with an independent management cluster configuration, as illustrated below:

Syself Supported Config Autoscaler

This mode of operation is referred to as "incluster-kubeconfig". For additional details, refer to the Cluster Autoscaler Helm Chart documentation .

Installation Guide

Step 1: Secret Creation in the Workload Cluster

For the autoscaler in the workload-cluster to function correctly, it requires authentication with the management cluster. This can be achieved by retrieving the secret from the autopilot/management cluster and then templating a new secret for the workload cluster.

Command:

note

Sorry, you need to be logged in to see this resource.

One-Step Command with Namespace and Workload Cluster Context:

note

Sorry, you need to be logged in to see this resource.

Step 2: Deploying the Cluster Autoscaler

We recommend utilizing the helm chart of the cluster-autoscaler for deployment to the workload cluster.

Add Helm Repository

shell
helm repo add autoscaler https://kubernetes.github.io/autoscaler

Install Cluster Autoscaler

Please update the following configurations:

  1. Set the value of autoDiscovery.labels[0].namespace to match the namespace of your cluster-object. Note that it should start with the prefix org- followed by your organization's name.
  2. Modify autoDiscovery.clusterName to reflect the name of your cluster. This name can be sourced from the cluster object.
shell
helm template -n kube-system cluster-autoscaler autoscaler/cluster-autoscaler \ --set fullnameOverride="cluster-autoscaler" \ --set cloudProvider="clusterapi" \ --set clusterAPIMode="incluster-kubeconfig" \ --set clusterAPIKubeconfigSecret="autopilot-kubeconfig" \ --set clusterAPICloudConfigPath="/etc/kubernetes/value" \ --set autoDiscovery.clusterName=my-cluster \ --set "autoDiscovery.labels[0].namespace=org-demo" \ --set extraArgs.scan-interval=30s \ --set extraArgs.scale-down-unneeded-time=5m \ --set extraArgs.scale-down-utilization-threshold=0.7 \ --set extraArgs.skip-nodes-with-system-pods=false \ --set extraArgs.skip-nodes-with-local-storage=false \ --set extraArgs.expander=least-waste \ --set extraArgs.v=4 \ --set extraArgs.unremovable-node-recheck-timeout=5m \ --set resources.requests.cpu=200m \ --set resources.requests.memory=400Mi \ --set resources.limits.memory=400Mi \ | kubectl -n kube-system apply -f -

To optimize autoscaling behaviors, specific flags can be adjusted. When deploying the Cluster Autoscaler with Helm, these flags can be passed using the extraArgs parameter.

shell
helm template ... --set extraArgs.scale-down-unneeded-time=10m --set extraArgs.scan-interval=30s ...
enforce-node-group-min-size Should CA scale up the node group to the configured min size if needed false
scale-down-delay-after-add How long after scale up that scale down evaluation resumes 10 minutes
scale-down-delay-after-delete How long after node deletion that scale down evaluation resumes, defaults to scan-interval scan-interval
scale-down-delay-after-failure How long after scale down failure that scale down evaluation resumes 3 minutes
scale-down-unneeded-time How long a node should be unneeded before it is eligible for scale down 10 minutes
scale-down-unready-time How long an unready node should be unneeded before it is eligible for scale down 20 minutes
scale-down-utilization-threshold Node utilization level, defined as sum of requested resources divided by capacity, below which a node can be considered for scale down 0.5

scale-down-non-empty-candidates-count

Maximum number of non empty nodes considered in one iteration as candidates for scale down with drain

Lower value means better CA responsiveness but possible slower scale down latency

Higher value can affect CA performance with big clusters (hundreds of nodes)

Set to non positive value to turn this heuristic off - CA will not limit the number of nodes it considers."

30

scale-down-candidates-pool-ratio

A ratio of nodes that are considered as additional non empty candidates for

scale down when some candidates from previous iteration are no longer valid

Lower value means better CA responsiveness but possible slower scale down latency

Higher value can affect CA performance with big clusters (hundreds of nodes)

Set to 1.0 to turn this heuristics off - CA will take all nodes as additional candidates.

0.1

scale-down-candidates-pool-min-count

Minimum number of nodes that are considered as additional non empty candidates

for scale down when some candidates from previous iteration are no longer valid.

When calculating the pool size for additional candidates we take

max(#nodes * scale-down-candidates-pool-ratio, scale-down-candidates-pool-min-count)

50

scan-interval How often cluster is reevaluated for scale up or down 10 seconds
max-nodes-total Maximum number of nodes in all node groups. Cluster autoscaler will not grow the cluster beyond this number. 0
cores-total Minimum and maximum number of cores in cluster, in the format <min>:<max> . Cluster autoscaler will not scale the cluster beyond these numbers. 320000
memory-total Minimum and maximum number of gigabytes of memory in cluster, in the format <min>:<max> . Cluster autoscaler will not scale the cluster beyond these numbers. 6400000
max-node-provision-time Maximum time CA waits for node to be provisioned 15 minutes
emit-per-nodegroup-metrics If true, emit per node group metrics. false
estimator Type of resource estimator to be used in scale up binpacking
expander Type of node group expander to be used in scale up. random
ignore-daemonsets-utilization Whether DaemonSet pods will be ignored when calculating resource utilization for scaling down false
ignore-mirror-pods-utilization Whether Mirror pods will be ignored when calculating resource utilization for scaling down false
write-status-configmap Should CA write status information to a configmap true
status-config-map-name The name of the status ConfigMap that CA writes cluster-autoscaler-status
max-inactivity Maximum time from last recorded autoscaler activity before automatic restart 10 minutes
max-failing-time Maximum time from last recorded successful autoscaler run before automatic restart 15 minutes
balance-similar-node-groups Detect similar node groups and balance the number of nodes between them false
skip-nodes-with-system-pods If true cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods ) true
skip-nodes-with-local-storage If true cluster autoscaler will never delete nodes with pods with local storage, e.g. EmptyDir or HostPath true
skip-nodes-with-custom-controller-pods If true cluster autoscaler will never delete nodes with pods owned by custom controllers true
daemonset-eviction-for-empty-nodes Whether DaemonSet pods will be gracefully terminated from empty nodes false
daemonset-eviction-for-occupied-nodes Whether DaemonSet pods will be gracefully terminated from non-empty nodes true
record-duplicated-events Enable the autoscaler to print duplicated events within a 5 minute window. false
note

The above flags focus on scaling behaviors. Adjust these based on the specific demands and characteristics of your workloads to achieve optimal scaling performance.

Step 3: Verification

To ensure that the deployment was successful, check if the pod is running using the following command:

shell
kubectl -n kube-system get pods -l "app.kubernetes.io/name=clusterapi-cluster-autoscaler"

Configuration Guide

Syself Autopilot, in conjunction with Cluster Autoscaler, offers advanced autoscaling configurations. By adding specific annotations to your machine deployments, you can tailor autoscaling behavior to fit your needs.

Zero-Scale Capability for hcloud Machines

For those using hcloud machines, a unique feature available is the ability to scale node groups down to zero. This feature is invaluable when particular node groups aren't actively running any pods, allowing them to scale down fully and save on infrastructure costs.

Advantages

  • Economic Efficiency: Only pay for active resources. Node groups can fully scale down during times of low demand, resulting in savings.
  • Optimal Resource Use: Reduces idle resources, leading to a more efficient infrastructure.
  • Adaptability: Ideal for fluctuating environments like development or testing.

To activate this zero-scale feature on hcloud machines, annotate your machineDeployments with cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0" .

Example Configuration

yaml
workers: machineDeployments: - class: workeramd64hcloud failureDomain: fsn1 metadata: annotations: cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5" cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0" name: md-0 variables: overrides: - name: workerMachineTypeHcloud value: cpx31 - name: workerMachinePlacementGroupNameHcloud value: md-0
Previous
Pod and service subnet configuration
Next
Using GPUs to run AI workloads