Managing Reliability With SLOs and Error Budgets
In this article, we're taking a look at how SLOs and Error Budgets help manage reliability - because 100% availability is complex and costly.
As businesses adopt a digital-first approach, it is critical to build reliable services without sacrificing innovation speed. The “want it now” customer culture is the new normal, and it forces companies to set unrealistic shipping expectations. Both you and your customers must be on the same page about system performance. Making sure that you don’t overpromise and underdeliver on the targets is the first step toward propagating a culture of building products with an acceptable user experience at an affordable budget.
As per Site Reliability Engineering (SRE) principles (documented in the Google SRE book), reliability is directly proportional to product success. Therefore, you must measure its reliability using three service level metrics, which will be discussed in detail in the following sections:
- Service Level Indicator (SLI)
- Service Level Objective (SLO)
- Service Level Agreement (SLA)
As developers ship new features and enhance existing ones, unknown factors might cause chaos in your system. However, this shouldn’t hamper your innovation quality and velocity. Enter error budgeting, which acts as an indicator to direct development efforts towards innovation or towards stabilization depending upon the amount of remaining error budget. Ben Treynor Sloss, Google’s VP of engineering who coined the term SRE, summarizes SRE as “What happens when you ask a software engineer to design an operations function.”
Let’s look at how each of these will help your organization race forward.
What Is an SLO and SLI?
A Service Level Objective (SLO) is a reliability target that is set to define the behavior we expect out of a service. In other words, an SLO is a target measure of how reliable a service is expected to be. For example, it could help you determine the downtime, error rate, or service request response time that’s acceptable for your service. But to understand what an SLO is, one must know what an SLI is.
SLI, or Service Level Indicator, is a metric that provides insights into the health of a service. It is the core metric used to indicate if an SLO is met. SLOs play a crucial role in shaping reliability goals that SREs must meet. They help Site Reliability Engineers measure their success when accomplishing those goals by figuring out what and how to measure.
Furthermore, Service Level Agreements (SLAs) are legal agreements that explain the implications if the service fails to meet the promised targets. For example, when a service has too many outages thus driving availability below the promised level, the service provider may be subject to paying fines or penalties.
As digital dominance increases, the expectations to build more resilient and reliable services increase. Customers have become accustomed to highly-available applications that are constantly functional and consumable. Eventually, balancing service reliability and feature delivery velocity becomes a challenge for companies.
100% reliability is an impossible objective that you might feel tempted to set. It’d simply mean that you choose not to make any changes in production, which is definitely not a wise business decision. Perfection isn’t the goal, but setting measurable and concrete reliability targets that allow for an appropriate rate of new feature delivery, will result in happy customers. Finding this balance is the center of offering compelling software experiences and simultaneously focusing on an organization’s survival.
Well-thought-out SLOs are realistically achievable reliability targets, or they can be summarized as a reasonable approximation of customer experience. Defining an SLO involves inputs from multiple stakeholders and various teams, and it is a collaborative process driven by the SRE team. SLOs act as the principal decision-making driver, which lets you discover the right balance between velocity and reliability. Breaching the SLO can potentially initiate activities that ultimately put pressure on engineering to stabilize the service before releasing new features.
# of requests that meet the defined threshold
---------------------------------------------- X 100
total number of requests
What Is an Error Budget?
An error budget is essentially an allowance for SLO violations that can accumulate over a certain timeframe for your service. It is the acceptable limit of unreliability before your customers are overly impacted. Failures are inevitable when you constantly change your systems. Therefore, normalizing failure as a part of the process helps teams balance innovation with the risk of SLA violation.
To improve the reliability and performance of your service, you must be capable of making important decisions, such as when and how much teams should prioritize new feature development work vs system stabilization efforts.
An error budget is a tool that helps teams take calculated risks and avoid obsessing over reliability. This tool helps the SRE and development teams to work in tandem, as well as control release velocity by making sure that SLOs are met. Plenty of error budget remaining indicates that developers can work on new features without significant risk. Once the error budget is exhausted, teams should cease deploying new features and focus on service quality and reliability. Keeping tabs on the error budget consumption helps you determine the appropriate deployment rate for each engineering team.
How Do SLOs and Error Budgets Help Manage Reliability?
The obvious decisions start when you don’t meet your SLO target and the remaining error budget falls below a threshold (you can define the threshold). The most common path includes stopping feature launches until the service is within SLO again, or working on reliability-related bugs. Reviewing your SLOs and error budgets on a periodic basis makes sure that you’re meeting expectations and reliability requirements. Error budgets act as an incident trigger to initiate a blameless postmortem (root cause analysis) on the impacted business services.
Calculation of the error budget according to the Google SRE book’s appendix:
Error Budget = 1 – Availability SLO
For example, if the SLO is 99.9%, then to calculate the error budget:
Error Budget = 1 – 99.9% = 0.1% = ~10 minutes of SLO violation per week or ~131 minutes per quarter
This 0.1% is the unavailability window. After exhausting the error budget, previously agreed-upon error budget policies help prevent any further customer impact. New releases are kept on hold while the team performs more testing. Having a healthy and mature SLO and error budget culture lets you refine how you measure and discuss the reliability requirements of your service.
Note: Scheduled maintenance windows should not count as SLO violations and therefore should not detract from the error budget.
In a distributed environment, offering 100% availability is technically complex and costly. Establishing SLOs and creating an error budget can be a long journey, but the results are well worth the investment. You will become equipped with the needed ammunition to detect potential customer-impacting issues before they become customer-facing. You must continually monitor key paths within your service that are frequently visited by your users. Aggregating this data helps define alerts and other actions in the event of a breach or near-breach.
Harness has created a solution to make it easier for newcomers to start using SLIs, SLOs, and Error Budgets. The solution also helps teams advance to a point of implementing SLO policies to automate guardrails within CI/CD pipelines. Learn more about Harness SRM, part of the Harness Software Delivery Platform, and request a demo today.