Cloud costs are a real problem, even for the experts. And if even the Chief Evangelist for AWS can make mistakes with his cloud resources, why wouldn’t we? This is actually a perfect example of the #1 compelling event I’ve repeatedly run across when it comes to cloud cost management: BILL SHOCK! 😱
Spending money on cloud is easy. Finding ways not to spend money is hard. The easiest way to not spend money is to not build anything, but that defeats the purpose of being innovative or being in business at all. But sometimes, it’s the only way to get control of rampant spending, or more likely overspending that was not budgeted for. In fact, I heard from the CTO of a public company (that I shall not name) that while they were a startup, their cloud costs ballooned to the point where he had to put a hard stop to his engineering team’s use of cloud resources, and he had to institute a process wherein he had to manually approve every cloud resource request. Talk about budget austerity.
As a group of concerned citizens, we’ve collectively figured out that to do cost management properly, we need to solve for three use cases:
Of course, there is some more nuance to each of these, but my point is that we’ve broken down the monolithic problem of cloud cost management into its component parts and we’re getting pretty darn good about solving them. There are a plethora of tools out there, and many are building homegrown solutions to meet their specific needs.
At the same time, cloud cost management is a relatively new discipline and we’re still learning the best ways to do it. In addition to there being multiple models and methods of cloud cost management, we’re trying to find ways to become ever more efficient in solving for our core use cases. As all of these progress, we’ll collectively end up moving towards models of cost management that require minimal human intervention, shifting instead to mostly hands-off functions with human guardrails. Maybe one day, we’ll even have AI powerful enough to take over the whole thing! That’s when we’ll truly have continuous cost efficiency.
For simplicity (and because "cost" is implied at this point), I'll be referring to "continuous cost efficiency" simply as "continuous efficiency" in this article. It also rolls off the tongue more easily.
Continuous efficiency is the gold standard for teams who want to balance costs without slowing down innovation.
At face value, continuous efficiency in cloud cost management is exactly what it sounds like: always being efficient about cost. Many organizations today take the approach of doing point-in-time optimizations, meaning they have a steady cadence at which they review costs and do something about them. Most often, this is monthly or quarterly. We can call this “discrete” efficiency rather than continuous, to use the mathematical term. Simply put, discrete means “point in time” whereas continuous means “all the time.”
I’d argue that to achieve truly continuous efficiency, we need complete machine takeover of our cloud infrastructure, because humans simply can’t keep up with the sheer amount of information that needs to be processed to make efficiency decisions. It’s the reason that so many organizations today only undertake these efficiency exercises every few months.
To do even one point-in-time efficiency exercise, you have to do the following:
Do you see how this model disrupts normal innovation processes? Not only is time taken to understand costs, but teams have to spend time redoing parts of their infrastructure to be more cost-efficient. In this way, we are balancing costs, but at the cost of innovation.
Cloud cost management solutions attack each of the problems that come with these cost management exercises, but as one problem is solved, it only gives rise to another. For example, the first cost visibility problem was that it was difficult to visualize costs. After that, costs needed to be broken down by service or product, with the ability to see costs at a per-resource granularity. Now, the big visibility problem is that resources need to be associated with budget owners, teams, or developers, so that chargeback and showback can be done accurately to help keep teams accountable.
You might notice a trend here: that the solutions start kind of generic and get more specific over time. It’s not just the view of costs, either. It’s the amount of people that get involved as these needs get more complex. At a high level, finance teams needed to be able to visualize costs. Then, engineering budget owners get pulled in. Now, even developers are being pulled in, creating an exponential increase in the number of stakeholders required to achieve financial visibility. As both the number of people and the amount of data increase, the time element of understanding costs also scales up. This in turn makes finding good insights difficult, and finding ways to action on them even harder.
As these problems came to the fore, cloud cost management solutions needed to simplify. Enter the term “actionable insights.”
With the evolution of cloud cost management, we see an increasing focus on reducing the time actual humans have to spend discovering issues so that they can spend time fixing them. Sometimes this is referred to as providing “actionable insights,” which the first generation of cloud cost management tools couldn’t do. Often, you had to visualize all of the information and generate your own insights. Today’s tools strive to adopt the mentality of providing these actionable insights so that humans aren’t spending time deciphering the data. Rather, they’re acting on it, and if they need to pore over the data, of course they can, because that capability is already inherent!
The most obvious example is the cost savings recommendation feature that essentially every cloud cost management tool provides at this point. Instead of having teams pore over the data and find their own savings, modern tools do the heavy lifting to help teams see the opportunities to downsize, rightsize, and purchase discount capacity. In addition, these recommendations often - and rightly should - include the data behind the recommendation, to show historical usage patterns.
Here’s where it gets interesting. You’d think that now that these actionable insights are at your fingertips, you could go and take action on the savings shown right away. In an ideal world, you’d be right - but in reality, it’s more complicated than that. To implement the savings you’ve found, they first need to be validated, and, you guessed it, they have to be validated by the same large group of people who had to be involved in finding and understanding all of the cost data in the first place.
The reason this is the case is that these are the cloud budget owners, and more importantly, they and their teams are the ones who understand how everything works and what impact infrastructure changes, such as downsizing compute instances like AWS EC2 instances to save money, will actually have on the applications. This is critically important, yet at the same time it slows us down in our quest for continuous efficiency.
Let’s catch our breaths for a second. We know actionable insights have come out of the need for simpler cost management processes, and that we’re still running into the same problem as before: no cost optimization opportunities matter if they don't make sense to implement from an engineering perspective.
Looks like we need to revise our process flow from earlier to add a step. Then, let’s explore how we can simplify this, too.
Let’s see what slows down the process from actionable insight, or any cost savings action plan for that matter, to actual implementation:
All of these problems take time to work through, resulting in cost recommendations that are often stale by the time they can be implemented - useless! We’re observing that the crucially important validation we need to do actually reduces the savings that we can achieve. To add insult to injury, many of the cost savings recommendations offered aren’t considered real because they fail some form of validation, which frustrates the teams that are tasked with cloud cost management, and leads to a great question for cloud cost management solutions of how to solve this problem.
I realize I just dropped a few problem bombs on you, but don’t worry - every problem has a solution. So, what’s the right way to go about solving these problems? Surely, there are tools out there and I’m about to pitch you one? Maybe, but not in the way you think.
More importantly, what does a solution look like? Let’s do some ideation.
Right off the bat, we can consider removing stakeholders. But the people involved are already the barebones stakeholders, and they need to be consulted. Knowing they are required, there are two other possibilities that come to mind:
If we can simplify the ways in which the organization can involve stakeholders and make decisions, we’re one step closer to optimizing costs continuously.
The most prevalent narrative I hear in cloud cost management is that today, engineering teams don’t have cost as a priority, but they are tasked with finding and managing costs on an ad hoc basis nonetheless. This creates frustration for engineers because it can be a trudge; the work strays from their core, which is shipping quality and performant code. And, it creates frustration for cloud cost management stakeholders because they aren’t realizing the savings they know are there. The culture just isn’t there.
One solution to this is to shift down cloud cost management, though this means giving engineering teams one more thing to worry about. If we want to do this right, there are 3 things to consider.
With engineers themselves armed to the teeth to manage costs, suddenly we’re optimizing at yet another level of the organization. At this point, we’ve made it easy for: the people who need to manage the costs; the budget owners and team leads that need to validate savings opportunities; and the engineers who need to ultimately implement them. We’re well on our way to continuous efficiency, but there’s more to consider.
We don’t want to mess things up in the pursuit of cost savings. Ultimately, a business doesn’t grow from saving money, it grows from happy paying customers. While we want to optimize our spend, it can’t be at the expense of application performance and user experience, which is why we ensure that cost savings recommendations that suggest changes to the infrastructure are well-vetted.
This can take a long time and invalidate many of these large cost savings recommendations, so how do we solve for that? Here are some ideas:
There’s a theme that’s emerging here - reduce toil while getting more microscopic in our optimization efforts. This is quite far removed at this point from where cloud cost management tools started, which is just visualizing costs. At this point, it’s starting to look like continuous efficiency demands integration into the day-to-day of the infrastructure. Let’s come back to this later.
Perhaps the biggest challenge with achieving continuous efficiency is the rate at which cloud resources are spun up and terminated. This incidentally also contributes to why waste accumulates and governance is hard to do, but that’s a different topic.
If cloud cost optimization requires point-in-time, or snapshot views, changing cloud resources means that continuous efficiency requires analyzing and acting on repeated snapshots over infinite defined periods of time. That’s how we go from discrete to continuous - I’m basically describing calculus!
The entire concept is to get more microscopic, both in calculus, and in continuous efficiency. Rather than look at costs at points in time and find point optimizations, we want to keep costs optimized at all times.
As it turns out, that’s exactly what we’ve been working towards this whole time! By empowering all levels of the organization to both understand and optimize costs, we’ve gotten more microscopic and thus closer to continuous efficiency.
The solution to quickly changing cloud costs is to stay on top of all of the resources, which is really only possible in the way we’ve done it, by giving everyone the power. We’ve successfully created the mechanism for continuous efficiency, but we’re still missing one piece.
Even though we’ve empowered everyone to find and manage costs, it’s still asking a lot to stay on top of it all the time. After all, cost isn’t the top priority for everyone. What ends up happening is that people across the organization will do things sporadically and there will be less bill shock, and indeed fewer large savings opportunities will come up during point-in-time exercises, because we’re more efficient now. However, we can do better.
If we could automate parts of the new workload so that we’re not entirely reliant on sporadic checks, then suddenly we’re doing a lot better. There is basic automation such as alerting integration into workflow tools like Slack, Teams, or email that we can start with, and it’s a huge help! Instead of having to check in all the time, we can make it so people only have to worry about things when there's a need for a human to intervene.
In that human intervention lies another opportunity, too. Alerts by their nature tell us when things are headed in the wrong direction and we need to look, but that only creates one type of cost efficiency. We can also automate the things that are going right and just be more cost-efficient about those too. If we could run our same workloads on cheaper cloud resources, why wouldn’t we, especially if it could be automated? And, can we automatically detect wastage of resources, for example, when they’re sitting idle, and stop them so we don’t pay for them?
This is the kind of micro-efficiency we want to create. We’ve gone successfully from the top-down view of how we can get better discounts all the way down to the nitty-gritty of how we stop spending so much in the first place, starting with wastage. This, I would argue, is as far as we’ve gotten with continuous efficiency in cloud costs, today.
Now that we know what continuous efficiency is in an organization, let’s take a look at what it should look like in terms of a solution. Product features, if you will.
What is the feature? What does it let you do? What do you get out of it? Application Context & Visibility See resource usage by application or other context Manage cloud costs more like managing performance Cloud Event Correlation Correlate config changes and deployments to cost changes Know exactly what happened that changed your costs Anomaly Detection Identify anomalous cost patterns before they snowball Avoid cost surprises on your monthly bill Root Cost Analysis Find exact change or resource that changed costs Save time hunting for cost-changing resources Hourly Time Granularity View cost changes almost as they happen Early and often insight into the impact of engineering changes Cluster & Non-Cluster Costs See all costs, whether for cloud or container Granular view into resource usage across all infrastructure Multi-Cloud Support See costs no matter which combination of cloud Consolidated views of spend across all cloud providers Recommendations Find savings opportunities with data behind it Get the biggest savings in the least amount of time Perspectives Create cost contexts and make it relevant for consumers Immediately-relevant cost information instead of digging Alerts & Budgets Set budgets and track progress towards it Make sure you don’t blow through your budget too fast Data API Pull data out for use in other locations or software Slice and dice data however makes the most sense for you Forecasting Predict where costs will be based on historical patterns Do better capacity planning and set budgets more easily Organizational Mapping Map costs to the exact owner of the resources Drive accountability with accurate chargeback and showback Automated Waste Minimization Automate making resource usage more efficient Save money while you sleep on resources that are idle What-If Analysis Compare cost vs. performance for resource planning Make more informed tradeoffs between cost and performance
Remember when I said it’s looking like cloud cost management solutions are needing to be integrated into the day-to-day of the infrastructure? Here’s where it comes into play. Not only do solutions need to provide contextually relevant visibility at any given time, they need to understand the usage intimately enough to deliver realized savings instead of potential savings. With solutions that can integrate well, the output isn’t a recommendation, it’s cost savings.
I wish I could jump right out and say that Harness Cloud Cost Management will solve all of your cloud cost management problems. While it’s not perfect, it’ll get you further than most other solutions. You have to recognize that continuous efficiency will never just be a tool thing - it’ll take organizational culture and people to get you there. Continuous efficiency is the gold standard for teams that want to balance costs without slowing down innovation.
To drive real continuous efficiency, you can’t rely on a single team or owner of all cloud costs to do everything - if you walk away from this article thinking it requires anything less than a full-court press across multiple teams, we’re not talking the same language.
While a tool like Harness Cloud Cost Management can provide the basis for discussion and simplify a lot of the lift required in cloud cost management, the right people still need to be brought onboard and trained on the expectations, with a tool that makes that easy.
Ready to do some further reading on cloud costs? Read our Cost Management Strategies for Kubernetes eBook.
Enjoyed reading this blog post or have questions or feedback?
Share your thoughts by creating a new topic in the Harness community forum.