Evaluate cluster hibernation cost savings

This document outlines how to assess the cost savings from cluster hibernation using the provided tool.

Cluster Hibernation Overview

One of the major challenges when using cloud-native clusters is resource underutilization, which leads to unnecessary costs. For instance, offline clusters such as development, testing, or demo environments often experience low utilization outside of working hours, yet organizations still pay for these idle resources.

Cluster Hibernation provides a mechanism to automatically or manually manage the suspension and resumption of clusters. By releasing and restoring nodes, it reduces resource consumption and optimizes resource utilization, ultimately leading to cost savings. For example, by configuring a hibernation strategy that shuts down clusters during off-hours (e.g., nights or weekends), companies can significantly reduce unnecessary resource usage and, consequently, operating costs. This makes Cluster Hibernation an effective cost management strategy, especially for non-production environments.

Cluster Hibernation Strategies

Cluster Hibernation works by gradually releasing nodes during the hibernation period while preserving the state of workloads (e.g., Deployments, Jobs) in the cluster. Upon resumption, the cluster nodes and workloads are restored to their pre-hibernation state. Clusters are typically managed through node groups (also known as node pools), which can consist of different node types such as reserved nodes (e.g., subscription-based) and on-demand nodes (e.g., pay-as-you-go or spot instances). Depending on whether a node group supports autoscaling, different hibernation strategies can be applied.

Non-Autoscaling Node Groups

For non-autoscaling node groups, where the number of nodes is fixed and does not dynamically adjust to workload demand, node management must be performed manually. During hibernation, the node count is set to zero, prompting the cloud provider to gradually release the nodes in the group. Upon resumption, the node count is restored to its original value. It is important to note that setting the node count to zero does not necessarily mean all nodes will be released. Reserved nodes, for instance, are typically retained.

Autoscaling Node Groups

Autoscaling node groups dynamically adjust the number of nodes based on actual workloads, optimizing resource usage. For such groups, the hibernation strategy involves modifying adjustable workloads and relying on the autoscaling strategy to downscale the nodes. For example, setting Deployment replicas to zero or pausing CronJobs triggers the autoscaling mechanism to reduce the number of active nodes.

Cluster Hibernation Benefit Evaluation

Hibernation Savings Estimation Tool

We provide the cluster-hibernate-saving-estimate tool to help users assess potential resource savings within a cluster. This tool scans each node group within the cluster and provides an overview, including the maximum potential savings, current savings, recommended actions, and the total sum of resource requests from Deployments in the node group. Key metrics include:

  • Max Potential Saving: The maximum savings achievable by optimizing the node distribution (e.g., adjusting node types to an ideal setup).
  • Potential Saving: The savings attainable under the current configuration of the node group.
  • Sum of Deployment Resource Requests: The total resource requests (CPU, memory) of Pods in the group’s Deployments, serving as a reference for evaluating resource usage. Higher request totals typically indicate greater potential savings in autoscaling node groups.

Note: The Hibernation Savings Estimation Tool can be downloaded from https://github.com/wiseinf/cluster-hibernate-saving-estimate/tags. It currently supports platforms like Alibaba Cloud and AWS.

Let’s explore two common scenarios to evaluate the benefits of Cluster Hibernation.

Scenario 1: Non-Autoscaling Node Group with Reserved Nodes

In non-autoscaling node groups containing only reserved nodes, a typical output is as follows:

NodeGroup: cpu-ng(npa324882932487c9777eaa7f6854e4)  Total Nodes: 4  Autoscaling: false
  OnDemandNodes: 0  
  SpotNodes: 0  
  ReservedNodes: 4  cpu: 32 cores, memory: 128 gib
    Node: cn-beijing.171.19.105.70(i-bp19u4ufadv9niflo1o4) NoSpot, InstanceType: ecs.g7ne.2xlarge, ChargeType: PrePaid
    Node: cn-beijing.171.19.105.74(i-bp19u4ufadv9niflo1o3) NoSpot, InstanceType: ecs.g7ne.2xlarge, ChargeType: PrePaid
    Node: cn-beijing.171.19.105.71(i-bp19u4ufadv9niflo1o5) NoSpot, InstanceType: ecs.g7ne.2xlarge, ChargeType: PrePaid
    Node: cn-beijing.171.19.105.251(i-bp161b0ldoqt1k771t5e) NoSpot, InstanceType: ecs.g7ne.2xlarge, ChargeType: PrePaid
  Max Potential Saving: CPU: 14125.71 core hours; Memory: 56502.86 gib hours
  Potential Saving: No saving, no spot or on demand nodes
  Recommendation: adjust some reserved nodes to on-demand or spot nodes based on its usage
  Sum of Deployment Resource Requests: CPU 8.63 cores, Memory 18.68 gibs

This node group consists of four reserved nodes, with a total of 32 CPU cores and 128 GiB of memory. Assuming the node group hibernates every evening (9 PM) and resumes in the morning (8 AM) from Monday to Friday, the potential savings are calculated as follows:

  • Maximum CPU savings: 14,125.71 core hours
    Calculation: 32 (CPU cores) * 720 (hours/month) * 103 (hibernation hours/week) / 168 (hours/week)

Similar calculations apply to memory savings.

In this case, there are no potential savings, as all nodes are reserved and setting the node count to zero does not release any of them. To achieve savings, some reserved nodes should be converted to on-demand nodes.

Note: When adjusting node types, consider the cost differences between reserved and on-demand nodes.

Scenario 2: Autoscaling Node Group with On-Demand Nodes

NodeGroup: as-cpu-ng(np151adb107448039712d3a24f0d50a)  Total Nodes: 12  Autoscaling: true
  OnDemandNodes: 6  cpu: 192 cores, memory: 384 gib
    Node: cn-beijing.171.18.106.158(i-bp6f7txtyecxuauzgk6m) NoSpot, InstanceType: ecs.hfc7.8xlarge, ChargeType: PrePaid
    Node: cn-beijing.171.18.106.76(i-bp6f7txtyecxuauzgk6o) NoSpot, InstanceType: ecs.hfc7.8xlarge, ChargeType: PrePaid
  ...
  Max Potential Saving: CPU: 98880 core hours; Memory: 197760 gib hours
  Potential Saving: CPU: 84754.29 core hours; Memory: 169508.57 gib hours
  Recommendation: no recommendation
  Sum of Deployment Resource Requests: CPU 224.00 cores, Memory 448.00 gibs

For this autoscaling node group, maximum potential CPU savings are calculated as:

  • Maximum CPU savings: 98,880 core hours
    Calculation: 224 (total CPU cores) * 720 (hours/month) * 103 (hibernation hours/week) / 168 (hours/week)

Similarly, the potential CPU savings amount to:

  • Potential CPU savings: 84,754.29 core hours
    Calculation: 192 (active CPU cores) * 720 (hours/month) * 103 (hibernation hours/week) / 168 (hours/week)

The potential CPU savings are lower than the maximum savings, as the current configuration does not fully optimize node utilization. Adjusting reserved nodes to on-demand could help achieve maximum savings.

Challenges & Solutions

Workload State Preservation and Restoration

Cluster Hibernation involves suspending and restoring workloads, and users often worry about incomplete restoration leading to workload unavailability. Since Cluster Hibernation mainly operates by adjusting node groups (releasing and restoring nodes), any issue arising during hibernation will likely affect production as well. In cloud-native environments, applications should be resilient. The best approach is to identify and address the root cause of workload unavailability to ensure the application can withstand node failures. This improves both production stability and resilience during hibernation.

Maintaining Workload Availability During Hibernation

To keep workloads running during hibernation, the following conditions must be met:

  • The workload should be schedulable on an autoscaling node group.
  • The workload should include the label wiseinf.com/reserved with a value of true. The system will skip workloads with this label during hibernation adjustments. Currently, only three workload types are supported: Deployment, DaemonSet, and CronJob.

Conclusion

Cluster Hibernation is an effective cost management strategy, especially for non-production environments. The Hibernation Savings Estimation Tool enables users to quickly evaluate potential resource savings for each node pool when applying hibernation strategies. By reviewing common scenarios (autoscaling and non-autoscaling node groups), users can better assess their cost-saving opportunities and adjust settings to reduce resource costs.

About the Cluster Optimizer Platform

The Cluster Optimizer platform, developed by WiseInf, is a comprehensive cloud-native optimization solution designed to help organizations reduce costs and enhance operational efficiency. By analyzing cloud resources, application performance, user behavior, and cloud vendor data, it identifies cost-saving opportunities and delivers tailored recommendations. The platform also automates the optimization process, minimizing manual errors and streamlining operations to ensure greater efficiency.