Harness Service Reliability Management (SRM)

All this author’s posts

Harness Service Reliability Management (SRM) enables engineering and reliability teams to collaborate on defining SLIs, SLOs, and error budgets, integrating reliability guardrails into CI/CD pipelines to halt execution upon SLO violations, thereby improving production reliability and compliance. This approach has been shown to significantly reduce production incidents by proactively identifying and addressing reliability issues throughout the software delivery lifecycle.

Harness SRM is a solution for engineering AND reliability teams. Within SRM, teams collaborate to define SLIs, SLOs, and Error Budgets. SRM users also create reliability guardrails within their CI/CD pipelines. These reliability guardrails determine whether or not pipelines are allowed to proceed to the next stage. SLO and Error Budget data is used to drive the behavior of the reliability guardrails. If SLOs are violated too often, Error Budgets become depleted, which causes the reliability guardrails to stop pipeline execution. Once pipeline execution is stopped, explicit approval must be provided for pipelines to proceed. This is all tracked in the SRM audit log for compliance purposes.

Harness <a href= — Service Reliability Management" id="" width="auto" height="auto" loading="auto">

To promote better production reliability, service reliability checks are performed across all stages of the software delivery lifecycle. Some of these reliability checks, like native error tracking from Harness, require an agent to be added to the application service. All other reliability checks are performed via integrations to external tools (APM, log analytics, testing, etc.). The goal of these checks is to identify as many reliability issues as possible before production. If done properly, production reliability will continually improve.

Key Capabilities

SLO Management

Harness SRM allows users to define, measure, and track SLIs, SLOs, and error budgets. It offers a collaborative workspace for engineering and reliability teams to define and view these key metrics. Say goodbye to those old-fashioned silos.

<a href= — SRM - SLOs" id="" width="auto" height="auto" loading="auto">

Change Impact Analysis

Change is the greatest factor in reliability issues. It’s estimated that 80% or more of production incidents are self-inflicted, caused by changes to the infrastructure or applications. Harness SRM brings together the worlds of reliability management and change detection. Harness correlates deployment and certain types of infrastructure changes with the SLO and Error Budget metrics over time. This shows how each change impacts the reliability of production application services. When the reliability team wants to know what changed, the information is available at their fingertips.

Service Reliability Checks

The best way to ensure highly reliable production services is to perform checks throughout the development lifecycle to find and fix any issues immediately. This is a much faster way to deal with reliability issues than the alternative, which is to stop working on new features until reliability has been restored. Harness initiates these service reliability checks via steps and stages in your CI/CD pipelines. Check early, check often, deploy with confidence.

Deployment Reliability Governance

Every organization needs their service reliability processes to scale with the business. If it doesn’t, they run the risk of inconsistent quality and reliability across application services, leading to customer satisfaction issues. Harness SRM has built-in governance using OPA (Open Policy Agent) to provide the flexibility to define policies as needed across your organization. You can easily define what service reliability checks need to be run at what stages, what constitutes a pass or fail from those checks, and make changes to these policies as required.

Enterprise-Grade Audit Trails & RBAC

Harness has built a reputation in the CI/CD industry for having incredibly detailed audit trails and fine-grained RBAC. These audit trails make it quick and easy for engineering teams to pass audits, often turning what would be days of effort into just a few hours. Our fine-grained RBAC model means that you can implement a permissions system that meets the needs of your organization - no matter how complex.

Ecosystem Integrations

Harness SRM integrates with many popular observability, APM, and logging solutions. Harness applies AI and ML techniques to this data to automatically figure out if there are reliability issues that need to be addressed.

SRM Demo

We’ve created this 7-minute demo video to provide a glimpse into how Harness SRM works.

Click here to view the web page for Harness SRM, which includes an explainer video and helpful illustrations.

Contact a Harness expert

Checkout Harness Comparison with other tools: Harness SRM v.s. Blameless

Jim Hirschauer

All this author’s posts

Jim Hirschauer is a IT geek who understands people and business.

Harness Service Reliability Management (SRM) - Key Capabilities | Harness Blog