This blog was co-authored by Nikunj Badjatya, Principal Software Engineer and Abhijeet Sharma, Senior Software Engineer at Harness.
At Harness, we leverage our own products extensively in our software development lifecycle, as would be expected from a company such as ours. Our engineering team is responsible for our CCM software architecture, and has direct visibility into our production costs by team and by module (among other views), and we review our cloud costs regularly to find areas for optimization for both our product costs and product performance.
It was during one of these reviews that we began to see our BigQuery analysis costs rising beyond what we were expecting from our rapidly growing installed base. Our CCM product dashboards and reporting are built on top of BigQuery, so each time our users view dashboards or run reports, it consumes on-demand compute capacity (or compute slots by the new pricing), alongside the monthly charges for storing the cost data itself.
What led to starting the project?
As the Harness CCM installed base has grown in both quantity and size of customers, we are seeing an increase in customers for whom we are ingesting GBs of cost data daily. As we began ingesting historical data for a major new customer, we were watching both our query and storage costs spike at an alarming rate. As you can see in the Perspective below, our BigQuery costs were up 73% over a 6 month period!
Customers are getting more mature in their FinOps practices, expanding their reporting among product teams, so we were also seeing an increase in usage of our Perspectives and BI Dashboards. Bigger, longer term dashboard views led to larger and more frequent calls to BigQuery, which was also impacting our costs. For example, one customer wanted to see 2 years of cost data in a single dashboard view. That’s a high volume of data!
The quantity of data was also beginning to impact the performance of our dashboards, as queries were taking longer and dashboard response times were slower, especially for customers with large datasets. Our customer experience is extremely important to us, so we knew we needed to take a hard look at our BigQuery structure to improve user experience, while also making a positive impact on our product’s bottom line.
As an aside, when we started our project to optimize our database structure, we also needed to assess our transition to the new BQ Editions pricing, and how it would impact us. When we first introduced Harness CCM, on-demand compute pricing was the only available option, but Google introduced BQ Editions earlier this year, and given the growth of our dataset, it made sense to assess making a shift.
This became an urgent change once Google announced a 25% increase in on-demand pricing, which would directly impact our bottom line. Our queries were growing in size of data retrieved, but we were looking relatively healthy for our slot usage as it’s billed in BQ Editions. So we made the switch, avoiding an immediate 25% cost increase!
What the Engineering Team Thought
As we brainstormed several different areas for improvement, we realized that we needed even more granular cost information by customer and use case, so we started by building a new BI dashboard to provide deeper insights into BQ costs by customer. This helped us segment BQ costs by customer size as well as activity levels, which allowed us to better characterize our cost drivers.
After reviewing the data, the team determined that we would get the best improvements for both cost and performance if we tackled the following challenges:
Flatten out nested labels to reduce UNNEST operations
Caching data in Looker with Persistent Data Tables
Review data ingestion frequency
Improve gatekeeping by license type
Flattening Out Nested Labels
We ingest customers labels from each of the cloud service providers our customers use, storing them in a BigQuery table as nested labels, as using nesting is a Google best practice to “denormalize data storage and increase query performance” while maintaining hierarchical relationships between labels.
During our investigations we discovered two things: first, that label queries were the single biggest contributor to our BigQuery costs; and second, that the way we had originally implemented our label storage was causing a high amount of UNNEST calls, increasing both response times and data transfer costs.
By flattening out how we store labels into a columnar format, we could eliminate the need for UNNEST calls which would improve both query performance and data transfer costs. This wasn’t a trivial change, but the effort yielded significant returns for us. We not only improved our immediate cost concerns, but ensured that this part of our database architecture was ready to handle the acceleration of customers adopting Harness CCM.
Looker Persistent Data Tables
Our customer facing Business Intelligence (BI) dashboards are powered by Looker, so that customers have a powerful BI tool at their fingertips for analyzing cloud costs. When we discovered that label queries were driving so much of our BQ costs, we realized that we could reduce a portion of that workload by caching data within Looker by using their persistent data tables (PDT).
We started by implementing them manually for our biggest customers in terms of dashboard usage to assess the impact on query costs and performance. It made a material impact on the frequency of BQ retrievals, so the next step is to automate the creation of PDTs when customers reach a certain threshold of dashboard usage within their account.
Review Data Ingestion Frequency
We always want to provide the most up to date information for our customers, so we pull cost data from the cloud providers multiple times per day to give our customers the most up to date costs. However, as part of this inquiry, we took a look at exactly how often customers themselves were looking at this data, whether through directly accessing the Harness console, or by sending automated, scheduled reports via email or Slack.
We realized that we had anticipated a use case that no one actually needed. Reports are generally sent weekly, or at most every day or two. Our own anomaly detection generally runs once per day well. So we were ingesting data (with the associated data transfer fees to BQ), more often than was needed / wanted, running up our own cloud bills unnecessarily.
We decided that once per day data ingestion from the cloud providers fit the actual use case our customers wanted. Implementing this change contributed a fair amount to our cost reduction efforts.
Improve Gatekeeping by License Type
Gatekeeping is easily one of the most impactful, but sometimes neglected, tasks for managing data, and as we reviewed our dashboards by customer usage, we realized that we were retrieving data into BigQuery that didn’t need to be.
For context, only our Enterprise license customers have access to our BI Dashboards feature for advanced analytics. Anyone using our Free tier, free trial, or engaged in a proof of value (POV) can’t access the BI Dashboards in our console. But we weren’t gatekeeping the data ingestion for our non-paid customers, and were ingesting data that wouldn’t be used.
Implementing a license check prior to ingesting data into BigQuery for Looker reduced the amount of data we were storing, which in turn also reduced our data ingestion fees for our existing customers. More importantly, it is also cost avoidance for the future, especially as our POV activity continues to grow.
Results Speak for Themselves!
After we implemented all of these BigQuery architectural changes, we were very pleased to share with the company that we had reduced our direct BigQuery costs by nearly 45%, while improving dashboard performance for our customers, especially our largest customers with large datasets. And don’t forget that we also had an indirect 25% savings by moving over to BQ Editions pricing.
These changes were critical for us as we continue to grow and scale our customer base. We’re now in a better position to keep our BQ costs in control, while giving our customers a better user experience.
This exercise involved numerous technical discussions within our internal teams, several different POCs to prove our theories and test changes, as well working closely with the Google BigQuery team to drive the correct implementation. Without a concentrated effort on the larger groups part, this change wouldn’t have been as successful as it is. We are very appreciative of the work that everyone involved put into this project.
This was just one example of BigQuery architecture changes that can be made to reduce cloud costs and improve performance. You can read about more cloud cost savings strategies we implemented a few years ago in this blog. If you want to learn more about Harness Cloud Cost Management, read about it here!
Editors Note: This is one of a new series of blogs that are focused on how Harness Cloud Cost Management users are “Architecting for Cost Savings”, highlighting their experiences where cloud architectural decisions and changes have positively impacted cloud costs. Other blog posts in the series: Architecting for Cost Savings at Tyler Technologies