Cloud costs are a real problem, even for the experts. And if even the Chief Evangelist for AWS can make mistakes with his cloud resources, why wouldn’t we? This is actually a perfect example of the #1 compelling event I’ve repeatedly run across when it comes to cloud cost management: BILL SHOCK! 😱
The Reality of Cloud Spend
Spending money on cloud is easy. Finding ways not to spend money is hard. The easiest way to not spend money is to not build anything, but that defeats the purpose of being innovative or being in business at all. But sometimes, it’s the only way to get control of rampant spending, or more likely overspending that was not budgeted for. In fact, I heard from the CTO of a public company (that I shall not name) that while they were a startup, their cloud costs ballooned to the point where he had to put a hard stop to his engineering team’s use of cloud resources, and he had to institute a process wherein he had to manually approve every cloud resource request. Talk about budget austerity.
As a group of concerned citizens, we’ve collectively figured out that to do cost management properly, we need to solve for three use cases:
Cost visibility, or understanding where costs are coming from
Cost savings, or figuring out how to spend less
Cost forecasting, or predicting where costs will be in the future
Of course, there is some more nuance to each of these, but my point is that we’ve broken down the monolithic problem of cloud cost management into its component parts and we’re getting pretty darn good about solving them. There are a plethora of tools out there, and many are building homegrown solutions to meet their specific needs.
At the same time, cloud cost management is a relatively new discipline and we’re still learning the best ways to do it. In addition to there being multiple models and methods of cloud cost management, we’re trying to find ways to become ever more efficient in solving for our core use cases. As all of these progress, we’ll collectively end up moving towards models of cost management that require minimal human intervention, shifting instead to mostly hands-off functions with human guardrails. Maybe one day, we’ll even have AI powerful enough to take over the whole thing! That’s when we’ll truly have continuous cost efficiency.
What is Continuous Cost Efficiency?
For simplicity (and because "cost" is implied at this point), I'll be referring to "continuous cost efficiency" simply as "continuous efficiency" in this article. It also rolls off the tongue more easily.
Continuous efficiency is the gold standard for teams who want to balance costs without slowing down innovation.
At face value, continuous efficiency in cloud cost management is exactly what it sounds like: always being efficient about cost. Many organizations today take the approach of doing point-in-time optimizations, meaning they have a steady cadence at which they review costs and do something about them. Most often, this is monthly or quarterly. We can call this “discrete” efficiency rather than continuous, to use the mathematical term. Simply put, discrete means “point in time” whereas continuous means “all the time.”
I’d argue that to achieve truly continuous efficiency, we need complete machine takeover of our cloud infrastructure, because humans simply can’t keep up with the sheer amount of information that needs to be processed to make efficiency decisions. It’s the reason that so many organizations today only undertake these efficiency exercises every few months.
To do even one point-in-time efficiency exercise, you have to do the following:
Take a snapshot of the infrastructure costs at a point in time
Get an understanding of where costs are coming from
Identify which of those costs are going to stick around because of usage or growth patterns
Determine if there are costs that can be eliminated
Confirm with engineering that cost elimination targets won’t break anything
Implement infrastructure changes to reduce costs
Plan for future capacity requirements in addition to existing ones
Negotiate with the cloud provider to get volume discounts
Put governance in place to ensure those discounted resources are used
And I’m sure there’s stuff I missed…
Do you see how this model disrupts normal innovation processes? Not only is time taken to understand costs, but teams have to spend time redoing parts of their infrastructure to be more cost-efficient. In this way, we are balancing costs, but at the cost of innovation.
Skipping the Line: Actionable Insights
Cloud cost management solutions attack each of the problems that come with these cost management exercises, but as one problem is solved, it only gives rise to another. For example, the first cost visibility problem was that it was difficult to visualize costs. After that, costs needed to be broken down by service or product, with the ability to see costs at a per-resource granularity. Now, the big visibility problem is that resources need to be associated with budget owners, teams, or developers, so that chargeback and showback can be done accurately to help keep teams accountable.
You might notice a trend here: that the solutions start kind of generic and get more specific over time. It’s not just the view of costs, either. It’s the amount of people that get involved as these needs get more complex. At a high level, finance teams needed to be able to visualize costs. Then, engineering budget owners get pulled in. Now, even developers are being pulled in, creating an exponential increase in the number of stakeholders required to achieve financial visibility. As both the number of people and the amount of data increase, the time element of understanding costs also scales up. This in turn makes finding good insights difficult, and finding ways to action on them even harder.
As these problems came to the fore, cloud cost management solutions needed to simplify. Enter the term “actionable insights.”
With the evolution of cloud cost management, we see an increasing focus on reducing the time actual humans have to spend discovering issues so that they can spend time fixing them. Sometimes this is referred to as providing “actionable insights,” which the first generation of cloud cost management tools couldn’t do. Often, you had to visualize all of the information and generate your own insights. Today’s tools strive to adopt the mentality of providing these actionable insights so that humans aren’t spending time deciphering the data. Rather, they’re acting on it, and if they need to pore over the data, of course they can, because that capability is already inherent!
The most obvious example is the cost savings recommendation feature that essentially every cloud cost management tool provides at this point. Instead of having teams pore over the data and find their own savings, modern tools do the heavy lifting to help teams see the opportunities to downsize, rightsize, and purchase discount capacity. In addition, these recommendations often - and rightly should - include the data behind the recommendation, to show historical usage patterns.
Here’s where it gets interesting. You’d think that now that these actionable insights are at your fingertips, you could go and take action on the savings shown right away. In an ideal world, you’d be right - but in reality, it’s more complicated than that. To implement the savings you’ve found, they first need to be validated, and, you guessed it, they have to be validated by the same large group of people who had to be involved in finding and understanding all of the cost data in the first place.
The reason this is the case is that these are the cloud budget owners, and more importantly, they and their teams are the ones who understand how everything works and what impact infrastructure changes, such as downsizing compute instances like AWS EC2 instances to save money, will actually have on the applications. This is critically important, yet at the same time it slows us down in our quest for continuous efficiency.
Between Here and There: The Problems
Let’s catch our breaths for a second. We know actionable insights have come out of the need for simpler cost management processes, and that we’re still running into the same problem as before: no cost optimization opportunities matter if they don't make sense to implement from an engineering perspective.
Looks like we need to revise our process flow from earlier to add a step. Then, let’s explore how we can simplify this, too.
Let’s see what slows down the process from actionable insight, or any cost savings action plan for that matter, to actual implementation:
There are a variety of teams involved with all costs, and they all have different priorities. They all need to: have a mutual understanding of costs; agree on the goals they want to achieve with cost management; validate the potential savings and agree they’re real; create a plan as to how savings will be acted upon.
The engineering teams responsible for incurring costs often aren’t prioritizing costs. Engineering (usually app dev) prioritizes performance for their end users, and cost doesn’t naturally play into the picture. When they see cost savings recommendations, they need to validate that the recommendations won’t have implications on performance.
Making infrastructure changes always carries risk. The same way individual app dev engineering teams need to validate for performance, the teams that own the cloud infrastructure have to ensure the infrastructure changes won’t break things. Since the infrastructure supports multiple applications and heavily impacts performance and user experience, it’s critical that this validation happens.
Cloud resources change so often that it’s hard to get a handle on them fast enough to optimize continuously. This is a big reason cloud cost management is often a point-in-time exercise, designed to optimize at a higher level than the individual resource, even though that may be where the biggest savings lay. You have to take a snapshot of your infrastructure and optimize based on that, leaving potentially significant opportunities on the table. Guesses often have to be made with regards to what that snapshot will look like in the future, or even in the days or weeks it may take to resolve the savings recommendations.
All of these problems take time to work through, resulting in cost recommendations that are often stale by the time they can be implemented - useless! We’re observing that the crucially important validation we need to do actually reduces the savings that we can achieve. To add insult to injury, many of the cost savings recommendations offered aren’t considered real because they fail some form of validation, which frustrates the teams that are tasked with cloud cost management, and leads to a great question for cloud cost management solutions of how to solve this problem.
We Don’t Just Bring Problems, We Bring Solutions
I realize I just dropped a few problem bombs on you, but don’t worry - every problem has a solution. So, what’s the right way to go about solving these problems? Surely, there are tools out there and I’m about to pitch you one? Maybe, but not in the way you think.
More importantly, what does a solution look like? Let’s do some ideation.
Problem 1: Lots of Stakeholders
Right off the bat, we can consider removing stakeholders. But the people involved are already the barebones stakeholders, and they need to be consulted. Knowing they are required, there are two other possibilities that come to mind:
Streamline the process and workflow across stakeholders to increase the speed of decision-making
Involve stakeholders early and often, empowering them to take control of costs on their own since they’re the experts anyway
If we can simplify the ways in which the organization can involve stakeholders and make decisions, we’re one step closer to optimizing costs continuously.
Problem 2: Cost Isn’t an Engineering Priority
The most prevalent narrative I hear in cloud cost management is that today, engineering teams don’t have cost as a priority, but they are tasked with finding and managing costs on an ad hoc basis nonetheless. This creates frustration for engineers because it can be a trudge; the work strays from their core, which is shipping quality and performant code. And, it creates frustration for cloud cost management stakeholders because they aren’t realizing the savings they know are there. The culture just isn’t there.
One solution to this is to shift down cloud cost management, though this means giving engineering teams one more thing to worry about. If we want to do this right, there are 3 things to consider.
Make it easy for engineers to track down costs so they don’t have to dedicate hours or days to finding costs
Streamline the process for them to find, validate, and implement cost savings such that it minimizes the toil and time investment
Provide them automated ways to keep costs down
With engineers themselves armed to the teeth to manage costs, suddenly we’re optimizing at yet another level of the organization. At this point, we’ve made it easy for: the people who need to manage the costs; the budget owners and team leads that need to validate savings opportunities; and the engineers who need to ultimately implement them. We’re well on our way to continuous efficiency, but there’s more to consider.
Problem 3: Cloud Infrastructure Risk
We don’t want to mess things up in the pursuit of cost savings. Ultimately, a business doesn’t grow from saving money, it grows from happy paying customers. While we want to optimize our spend, it can’t be at the expense of application performance and user experience, which is why we ensure that cost savings recommendations that suggest changes to the infrastructure are well-vetted.
This can take a long time and invalidate many of these large cost savings recommendations, so how do we solve for that? Here are some ideas:
Suggest small changes instead of massive ones. This is a classic method of reducing risk, and it’s easier to trust and try. Who knows, doing a bunch of these might even keep recommendations from going stale, resulting in more realized savings. For example, a recommendation to change requests and limits on a bunch of individual Kubernetes resources and then optimizing the cluster or node is less risky than downsizing an entire cluster or compute instance fleet right off the bat.
Create “one and done” solutions. Rather than asking teams to constantly revisit for optimization opportunities, we can look for simpler solutions that they need to do once and will have them “covered,” so to speak, for a longer period of time. Some simple examples of this are setting resource quotas and implementing resource scheduling.
Automate things, safely. In the same way that we can empower engineers to optimize their own costs, what if they, and infrastructure teams, just had less to worry about? For example, what if we could take resource scheduling to the next level and automatically stop idle resources so they don’t incur costs? In this way, costs are optimized at a micro level and minimize cause for alarm for budget owners and cloud cost managers in the first place. It also reduces toil for infrastructure and engineering teams.
There’s a theme that’s emerging here - reduce toil while getting more microscopic in our optimization efforts. This is quite far removed at this point from where cloud cost management tools started, which is just visualizing costs. At this point, it’s starting to look like continuous efficiency demands integration into the day-to-day of the infrastructure. Let’s come back to this later.
Problem 4: Cloud Resources Change Quickly
Perhaps the biggest challenge with achieving continuous efficiency is the rate at which cloud resources are spun up and terminated. This incidentally also contributes to why waste accumulates and governance is hard to do, but that’s a different topic.
If cloud cost optimization requires point-in-time, or snapshot views, changing cloud resources means that continuous efficiency requires analyzing and acting on repeated snapshots over infinite defined periods of time. That’s how we go from discrete to continuous - I’m basically describing calculus!
The entire concept is to get more microscopic, both in calculus, and in continuous efficiency. Rather than look at costs at points in time and find point optimizations, we want to keep costs optimized at all times.
As it turns out, that’s exactly what we’ve been working towards this whole time! By empowering all levels of the organization to both understand and optimize costs, we’ve gotten more microscopic and thus closer to continuous efficiency.
The solution to quickly changing cloud costs is to stay on top of all of the resources, which is really only possible in the way we’ve done it, by giving everyone the power. We’ve successfully created the mechanism for continuous efficiency, but we’re still missing one piece.
Killing Bill Shock with Automation
Even though we’ve empowered everyone to find and manage costs, it’s still asking a lot to stay on top of it all the time. After all, cost isn’t the top priority for everyone. What ends up happening is that people across the organization will do things sporadically and there will be less bill shock, and indeed fewer large savings opportunities will come up during point-in-time exercises, because we’re more efficient now. However, we can do better.
If we could automate parts of the new workload so that we’re not entirely reliant on sporadic checks, then suddenly we’re doing a lot better. There is basic automation such as alerting integration into workflow tools like Slack, Teams, or email that we can start with, and it’s a huge help! Instead of having to check in all the time, we can make it so people only have to worry about things when there's a need for a human to intervene.
In that human intervention lies another opportunity, too. Alerts by their nature tell us when things are headed in the wrong direction and we need to look, but that only creates one type of cost efficiency. We can also automate the things that are going right and just be more cost-efficient about those too. If we could run our same workloads on cheaper cloud resources, why wouldn’t we, especially if it could be automated? And, can we automatically detect wastage of resources, for example, when they’re sitting idle, and stop them so we don’t pay for them?
This is the kind of micro-efficiency we want to create. We’ve gone successfully from the top-down view of how we can get better discounts all the way down to the nitty-gritty of how we stop spending so much in the first place, starting with wastage. This, I would argue, is as far as we’ve gotten with continuous efficiency in cloud costs, today.
Providing deep visibility and data-backed recommendations for high-level optimizations based on historical usage volume and projected future usage
Empowering teams at all levels to get the same visibility and savings ability, but at a context that is relevant to individual owners
Creating automation that’s integrated into the day-to-day of the infrastructure to lessen toil, both for things that need attention, and for things that can be more efficient without human intervention
Required Capabilities for Continuous Efficiency
Now that we know what continuous efficiency is in an organization, let’s take a look at what it should look like in terms of a solution. Product features, if you will.
What is the feature? What does it let you do? What do you get out of it? Application Context & Visibility See resource usage by application or other context Manage cloud costs more like managing performance Cloud Event Correlation Correlate config changes and deployments to cost changes Know exactly what happened that changed your costs Anomaly Detection Identify anomalous cost patterns before they snowball Avoid cost surprises on your monthly bill Root Cost Analysis Find exact change or resource that changed costs Save time hunting for cost-changing resources Hourly Time Granularity View cost changes almost as they happen Early and often insight into the impact of engineering changes Cluster & Non-Cluster Costs See all costs, whether for cloud or container Granular view into resource usage across all infrastructure Multi-Cloud Support See costs no matter which combination of cloud Consolidated views of spend across all cloud providers Recommendations Find savings opportunities with data behind it Get the biggest savings in the least amount of time Perspectives Create cost contexts and make it relevant for consumers Immediately-relevant cost information instead of digging Alerts & Budgets Set budgets and track progress towards it Make sure you don’t blow through your budget too fast Data API Pull data out for use in other locations or software Slice and dice data however makes the most sense for you Forecasting Predict where costs will be based on historical patterns Do better capacity planning and set budgets more easily Organizational Mapping Map costs to the exact owner of the resources Drive accountability with accurate chargeback and showback Automated Waste Minimization Automate making resource usage more efficient Save money while you sleep on resources that are idle What-If Analysis Compare cost vs. performance for resource planning Make more informed tradeoffs between cost and performance
Cloud Cost Savings with Cloud Cost Management Solutions
Remember when I said it’s looking like cloud cost management solutions are needing to be integrated into the day-to-day of the infrastructure? Here’s where it comes into play. Not only do solutions need to provide contextually relevant visibility at any given time, they need to understand the usage intimately enough to deliver realized savings instead of potential savings. With solutions that can integrate well, the output isn’t a recommendation, it’s cost savings.
Creating Continuous Efficiency with Harness
I wish I could jump right out and say that Harness Cloud Cost Management will solve all of your cloud cost management problems. While it’s not perfect, it’ll get you further than most other solutions. You have to recognize that continuous efficiency will never just be a tool thing - it’ll take organizational culture and people to get you there. Continuous efficiency is the gold standard for teams that want to balance costs without slowing down innovation.
To drive real continuous efficiency, you can’t rely on a single team or owner of all cloud costs to do everything - if you walk away from this article thinking it requires anything less than a full-court press across multiple teams, we’re not talking the same language.
While a tool like Harness Cloud Cost Management can provide the basis for discussion and simplify a lot of the lift required in cloud cost management, the right people still need to be brought onboard and trained on the expectations, with a tool that makes that easy.