The Must Have Metrics Any DevOps and SRE Manager Should Measure
To understand what's happening inside each company, we need to measure the performance of DevOps and SRE teams.
DevOps and SREs have dashboards to monitor services and product performance, and now it’s time for us to understand how to measure these teams’ performance as well.
One significant challenge every company faces these days is measurement. There’s a need to understand what’s happening at every level inside the company and product, from how customers are using the application, to the quality and efficiency of the code, and down to the team’s performance.
When it comes to measuring DevOps and SRE teams, we’re faced with a whole new challenge. They’re in charge of the delivery funnel, and it’s their job to measure and make sure it’s working as it should. From the developers who write the code, to the tools that test and deploy it, and down to the way the product behaves in the real world.
While DevOps and SREs measure performance, making sure every step in the application lifecycle is functional, we need to understand how to measure them in return. We told you it’s a challenge, but we also have some good news – It’s possible, as long as you focus on what’s important.
What Should We Measure?
DevOps and SREs have to stay on top of everything that’s happening inside the application. They need to have a real-time monitoring system, that will help them see the application uptime, load time, number and success of API calls, CPU process threads, memory usage and other metrics.
And while it is their job to make sure everything is up and running, we want to make sure they’re doing it as expected. To do so, we also need to look across the entire application and workflow to find the answers and data about every parameter we’re interested in.
These usually include:
- We want to make sure that dev teams are delivering faster than before, by looking at the cycle time.
- We need to know that fast deployments are not hurting the quality of the code, which can be measured by the availability of the product.
- We want to monitor the product’s quality by looking at the rollback percentage.
- And of course, we want to make sure our users and customers are happy, which can be done by looking at the rollback percentage or complaints sent.
Each one of these parameters contains a world of metrics and calculations, and trying to monitor all of them is like trying to photograph an entire fireworks show on the Fourth of July: you can do it, but you’re missing the point.
To help us help ourselves, we need to narrow down what we’re looking at. And since our goal is to measure our own teams and operations, it’s easier to take a step back and have a broader look at everything. Now, let’s turn these analogies into practices.
Borrowing Google’s Focal Points
DevOps and SREs have to monitor a lot of different aspects of the application, but that doesn’t mean that we need to monitor every single one of them as well. Furthermore, it doesn’t mean we need a number of dashboards just to understand whether the team is doing their job or not.
To narrow this down, we can adopt Google’s approach to measuring its SRE teams. The company encapsulates all of the elements needed to monitor DevOps and SRE into three essential measurements, each with its own baseline:
Service-Level Objective (SLO)
In Google, the Service-Level Objective (SLO) is a number or a percentage that indicates system availability. It’s an indication of whether the system is running as it should, and whether the product is stable or not.
This number will help us understand the state and quality of our product, as well as ensure the quality of our code as we push deployments faster. Part of the responsibilities of DevOps and SRE teams is to maintain application reliability and functionality. This metric clearly represents how successful the team is in accomplishing that goal.
Service-Level Indicator (SLI)
This metric measures the failures per request, by calculating request latency, the throughput of requests per second, or failures per request as measured over time. It connects to the SLO number that was determinate, and helps evaluate if the team is within its SLO.
With visibility into metrics like system availability, it’s easier to understand when errors and failures occur. The logical next step is seeing what causes errors and failures. It will allow us to monitor how reliable our service is, and we will be able to do so by looking at the same stats that the DevOps and SRE teams are measuring.
Service-Level Agreement (SLA)
A Service-Level Agreement (SLA) is an agreement between you and your users/customers, that indicates the availability of the services and products. Unlike SLOs and SLIs, this is a loose metric that can change according to the service you provide, or the customer you’re providing it to.
The SLAs should derive from the SLOs, since you want to make sure you have your own definition and understanding of the system availability, before you make contracts and promises with your customers. Metrics-wise, monitoring the SLAs will help us understand whether DevOps and SREs are keeping up with the numbers they set up for themselves.
You want your product to be good, your customers to be happy and your company to succeed. But how will you know if you’re on the right track without attaching the correct numbers and metrics to it?
It’s a challenge to understand how to monitor DevOps and SREs, but you need to be able to measure everything that’s a part of your product – and these teams are a big part of it.
While each company has its own set of requirements, methods and team structures, focusing on monitoring SLOs, SLIs and SLAs will help you understand what metrics to focus on, and how your teams are performing.