Introduction to Harness Cluster Orchestrator for Amazon EKS
In this tutorial, we detail how the new Harness Cluster Orchestrator for Amazon Elastic Kubernetes Service (EKS) feature within the Harness Cloud Cost Management (CCM) module can empower engineering, DevOps, and CloudOps teams with all the intelligence needed to scale Amazon EKS cluster nodes driven by unique workload requirements.
Efficiently managing cluster infrastructure requires navigating a range of challenges, from balancing workload requirements, gaining visibility, allocating costs, and optimizing clusters. The ability to intelligently scale cluster nodes based on real-world workload requirements and size them correctly is necessary to avoid unscheduled pods, as well as over provisioned workloads and cluster nodes. This intelligence must also include savings from fully leveraging cloud excess capacity, the context of nodes covered by reservations, and more.
In this tutorial, we detail how the new Harness Cluster Orchestrator for Amazon Elastic Kubernetes Service (EKS) feature within the Harness Cloud Cost Management (CCM) module can empower engineering, DevOps, and CloudOps teams with all the intelligence needed to scale Amazon EKS cluster nodes driven by unique workload requirements. Additionally, by leveraging CCM’s distributed Spot orchestration capability, you can save up to 90% on cloud costs with Amazon EC2 Spot Instances.
CCM’s Cluster Orchestrator for EKS provides workload-driven intelligent autoscaling, which efficiently manages cluster infrastructure by automatically balancing workload requirements with cloud infrastructure and intelligently autoscaling AWS Spot instances based on workload requirements. It also provides granular cost visibility, a convenient method for cost allocation for chargeback and showback, automated optimization of EKS clusters, and a simplified approach to workload and node resizing to avoid over provisioning.
Harness is proud to be part of the AWS Service Ready Program for Amazon EC2 Spot instances. This designation recognizes that Harness CCM with Cluster Orchestrator implements best practices and API support to successfully manage Amazon EC2 Spot instances in customers' cloud environments. Joining the Amazon EC2 Spot Service Ready Program differentiates Harness as an AWS Partner Network (APN) member with a product that works with Amazon EC2 Spot instances and is generally available for and fully supports AWS customers, helping them benefit from Amazon EC2 Spot savings for their workloads.
Getting Started with Harness Cluster Orchestrator
Before we get started, if you’re not currently using Harness CCM, you can sign up for free. (Note: the Cluster Orchestrator is currently in beta behind a feature flag and available by request.)
Adding a New Cluster
The first step in enabling Cluster Orchestrator for your EKS cluster is to add the cluster to Harness CCM using a CCM Kubernetes Cluster Connector. Next, navigate to Cluster Orchestrator in the CCM module and click on the “Add New Cluster” button in the top right.
This will open up a wizard that allows you to select the CCM Kubernetes connector that you created previously. Next, you will be asked to download and apply a YAML file to the EKS cluster, which gives Harness the permissions required to provide you with granular visibility of your cluster.
All Clusters Overview
The overview screen of the Cluster Orchestrator gives you a high-level view of all the EKS clusters that you have added. This screen helps you manage all of these clusters with aggregate information on the overall cluster spend, savings realized by leveraging AWS Spot instances, total nodes across all clusters, including split across On-Demand, Spot, and Fallback, as well as the cluster cost breakdown of idle, utilized and unallocated costs. You will also see a table with each cluster listed below.
There you will see metadata associated with each cluster including the cluster name, cluster ID, region, node count, CPU/memory and total spend for that cluster. Each cluster row will also show the option to enable the orchestration that Cluster Orchestrator provides. Clusters that already have orchestration enabled will be shown as such. You can see this in the UI image below.
Clicking on each cluster will open up a breakdown with more details of that individual cluster. Here, you will see the total spend for that cluster for the current month, including details like cluster name, region, and identifier, along with the total nodes of that cluster, including the split of On-demand, Spot, and Fallback. You will also see how many of the nodes are managed by EKS and how many are managed by Harness. For the pods, you will see the total pod count including the split of On-Demand and Spot. You will also see how many of the pods are scheduled and how many are unscheduled. Finally, you will see the cost breakdown across idle, utilized, and unallocated costs for both CPU and memory for that cluster.
Clicking on the “Workloads” tab on the top right gives you a deeper view of the workloads in that cluster. You will see an aggregate of the total spend across all workloads, the total replica count across all workloads, and the Spot savings realized by running workloads on Spot instances. The table below will show you a list of all the workloads in the cluster, including each workload name, the namespace it belongs to, total replica count and the split of On-demand and Spot for all the replicas of that workload. You will also see the distribution strategy, total cost of the workload, and recommendations to resize the workload for further cost optimization.
Spot instances are great for Spot-ready workloads that are Spot-ready, but critical workloads that you don't want to associate with any kind of Spot interruption risk need more advanced handling. Cluster Orchestrator’s Distribution Strategy for Spot orchestration provides just that. This is configured on the cluster with a custom resource called the workload distribution rule, which enables several cost-saving functionalities.
Split Workload Replicas
With the workload distribution rule, you can split replicas of a single workload to run on both Spot and On-Demand nodes to maximize savings while minimizing interruption risk. That means if you have a workload with four replicas for example, you can run one of those replicas on a Spot node and the other three on On-Demand nodes. This lets you leverage Spot savings even in critical or production workloads that you would otherwise only run on On-Demand nodes.
You can set a cost-optimized distribution strategy if you want to run all Spot replicas on the least number of Spot nodes. This helps maximize Spot savings by running the lowest required Spot nodes, but this strategy is slightly more prone to Spot interruptions. Alternatively, you can set a least-interrupted distribution strategy that will ensure all Spot replicas are running on different Spot nodes as much as possible. This will significantly reduce the chances of Spot interruption, making it much safer with respect to availability since a Spot interruption means that other Spot replicas remain running on different Spot nodes than the one that was interrupted.
Base On-Demand Capacity
Similar to auto-scaling groups, you can configure a base on-demand capacity to ensure that a minimum number of On-Demand replicas are running; beyond those, only the Spot/On-Demand distribution ratio from the first option above will take effect.
Clicking on the “Nodes” tab provides you with an overview of all the nodes in that cluster. You will see a list of nodes running in that cluster including the node name, how many workloads are running on each node, the node instance type, the fulfillment (either Spot or On-Demand), CPU, memory, age of the node and the node’s current status.
Setting Up Orchestration
Setting up orchestration for your EKS cluster entails the following steps.
The first step to setting up orchestration for your EKS cluster is to download and apply a YAML file to the cluster. This provides Cluster Orchestrator the necessary permissions to perform the EKS cluster orchestration.
Once the YAML file is applied to the cluster and the permissions are verified, the next step is to set up Spot orchestration. Here, you can choose if you want to run all EKS workloads on Spot instances or only the Spot-ready workloads. In this case, Spot-ready refers to workloads with more than one replica.
You also have the ability to cherry-pick individual workloads that you would like to run on Spot instances. Once you have made this selection, you will see an estimate of how much you can expect to save using Spot instances against the On-Demand equivalent.
Next, you can configure the following cluster preferences:
The cluster buffer adds headroom to the total capacity of the cluster, and can be configured separately for Spot and On-Demand nodes.
Reverse Fallback Retry
When Spot capacity is unavailable during a Spot interruption event, the Cluster Orchestrator will fall back to an On-Demand node temporarily to protect availability requirements. However, the solution then periodically checks for Spot capacity, so it can do a reverse fallback to Spot once capacity is available again. With the Reverse Fallback Retry option, you can configure the interval that the retry should be performed at.
Node Deletion Delay
By default, nodes with no pods are deleted from the cluster. With the node deletion delay option, you can configure a delay before this delete operation is carried out.
Bin-packing of Single Replica Workloads
Typically, bin-packing excludes single replica workloads to avoid disruptions while moving them from one node to another; however, with this option, you may configure the cluster orchestrator to include bin-packing of single replica workloads for a more aggressive bin-packing strategy.
With this option, you can configure a restriction (minimum and maximum) on the CPU resources for the EKS cluster.
Next, we configure the node preferences.
Node selection – by node instance families
This option lets you configure which instance families you would like the EKS cluster nodes to be from. When deciding which instance family nodes to bring up, the Cluster Orchestrator will only bring up nodes from the instance families selected here.
Node selection – by node constraints
If you are not particular about the exact instance families you would like to include or exclude from your EKS cluster nodes, then you can configure node constraints with this option. This will simply restrict the nodes to these minimum and maximum constraints for CPU/memory, irrespective of node instance family.
Specifically for non-production EKS clusters, you can set up Scheduling or AutoStopping.
Entire Cluster Scheduling
With this option, you can set a fixed uptime or downtime schedule for the entire EKS cluster.
For example, you can shut down an entire cluster during non-working hours on weekdays, or weekends and company holidays.
Specific Workloads AutoStopping
One of the most powerful features of Harness Cloud Cost Management is AutoStopping. For non-production EKS clusters, you can configure AutoStopping for individual cluster workloads. This allows for far greater savings than a schedule, as it dynamically scales down workloads that are idle beyond a pre-configured time. The feature also scales up the workloads in real time to service new incoming requests whenever required. This helps you save on costs for the most granular windows of idleness, without any difficulty in accessing workloads that are scaled down.
Benefits of Cluster Orchestrator for EKS
There are several benefits of using the Cluster Orchestrator for your EKS clusters, from up to 90% cost savings by leveraging the spot orchestration capabilities even for critical production workloads to improving cluster efficiency with context-aware autoscaling of cluster nodes.
Built-in Spot Orchestration
The Harness Cluster Orchestrator has full Spot orchestration built in. At its core, this is what makes the distribution strategy described above possible. With this orchestration, you can run your workloads on Spot instances and achieve up to 90% savings over On-Demand instances without worrying about availability challenges resulting from Spot interruptions.
Alternate Spot Instance
Upon receiving a Spot interruption notice from AWS, the Cluster Orchestrator will automatically provision an alternate Spot instance. This alternate Spot instance will have available Spot capacity, enough resources to run the pods, and the least likelihood of future interruptions.
Spot to On-Demand Fallback
The Cluster Orchestrator temporarily switches from Spot to On-Demand when Spot capacity is unavailable for that instance. This maintains service availability and avoids running under capacity, though at a temporarily higher cost of On-Demand.
On-Demand to Spot Reverse Fallback
During the fallback from Spot to On-Demand, the Cluster Orchestrator periodically monitors for available Spot capacity. When it finds Spot capacity available, it does a reverse fallback to Spot. This operation is fully automated and requires no manual intervention.
The Cluster Orchestrator has bin-packing capabilities built in. This ensures that the optimal number and choice of pods are packed onto cluster nodes to potentially free up and optimize cluster node resources. As moving such applications from one node to another may cause disruptions, you can choose to enable bin-packing for single replica applications that are otherwise excluded from bin-packing by default.
The Cluster Orchestrator has a first-class integration with another Harness feature called the Commitment Orchestrator, which orchestrates the purchase and utilization of Reserved Instances (RIs) and Savings Plans to maximize savings and to maximize compute spend coverage. This integration enables the Cluster Orchestrator to prioritize nodes covered by commitments over On-Demand nodes when Spot is either not applicable or not available.
Automatically Scaling Down of Idle Workloads
There is tremendous savings using Spot instances and commitments, but nothing beats scaling down idle workloads to zero. The Cluster Orchestrator has first-class integration with Harness CCM’s AutoStopping feature, which automatically scales down idle workloads all the way to zero. This includes scaling down other dependent services, VMs, RDS databases, etc. AutoStopping will also automatically scale up the workload to service new incoming requests as required based on network traffic.
Example: 84% savings on a non-production cluster
Let’s take an example of a non-production EKS cluster to illustrate how staggering the savings can be with Harness CCM and the Cluster Orchestrator.
This sample non-production EKS cluster has 10 m4.xlarge nodes running in the US East (Ohio) on Linux.
EKS EC2 Nodes: 10
- CPU - 4
- Memory - 16 GiB
- Price - $144/mo
- m4.xlarge in US East (Ohio) on Linux
Pods per Node: 4
- CPU - 1
- Memory - 4 GiB
- Price - $36
Without the Cluster Orchestrator: On-Demand cost, running all month: $1,440
With the Cluster Orchestrator & AutoStopping: On-Demand cost after dynamic idle time scale down: $533 ($907 or 63% savings)
- Avg. 70% idle time per month across pods
- Min. 1 Node at all times
Final Spot cost after dynamic idle time scale down: $231 ($1,209 or 84% savings)
- 81% Spot savings over On-Demand for m4.xlarge running in US East (Ohio) on Linux
- 70% Spot nodes, 30% On-Demand nodes
Cluster Cost Visibility
The Cost Perspectives feature of Harness CCM provides granular visibility into your EKS cluster costs. This shows you a cost breakdown to an hourly level of granularity for a period of 14 days, as shown below. This can be invaluable to pinpoint cost spikes to the exact hour that they occurred.
You can also drill down into each workload for an even more granular view on CPU/memory utilization over time, as shown below.
Leveraging this deep, granular cost visibility along with our Cost Categories makes for a seamless experience for chargeback and showback. Cost Categories lets you define and track costs based on Cost Buckets, which can be used for various constructs such as teams, departments, business units, etc., as shown below. You can maintain these same cost bucket and cost category definitions for use across various Cost Perspectives for easy management at scale.
Harness provides recommendations to resize workloads. The recommended requests and limits configuration ensure that you satisfy the historical CPU/memory workload requirements while still being able to save significantly on cost. This comes with additional tuning options to match your real world requirements, as shown below.
One of the most popular features of Harness CCM is anomaly detection. With this feature, you can be notified of cost spikes as they occur on a daily basis, so you can proactively take corrective action instead of waiting for your cloud bill or report at the end of the month. What makes this even more powerful is that you can configure alerts via email or Slack, so you don’t have to constantly check on a dashboard or report.
Manage Your Cluster and Cloud Costs with Harness
In this tutorial, you have gone through all the steps required to successfully enable Cluster Orchestrator for your AWS EKS cluster. You have also seen all the benefits of doing so in terms of efficiently managing your cluster and also saving significantly on costs.
The Cluster Orchestrator for EKS comes with a powerful new distribution strategy for Spot orchestration, along with many other features to help you fully leverage up to 90% savings from Spot instances for all your cluster workloads. Additionally, you can leverage all of the other capabilities of Harness CCM to effectively manage and control your cluster and cloud spend.
Whatever your goals and objectives are for properly managing your cluster and cloud costs, and realizing significant savings on your spend, Harness CCM has you covered.
Ready to see the Cluster Orchestrator in action? Sign up for a personalized demo today!