Achieving Continuous Resilience with Harness Chaos Engineering

Authors:

Table of Contents

Adopting a continuous resilience approach in chaos engineering automates and integrates fault injection across all SDLC stages, enhancing system reliability and reducing production incidents. This method involves shared responsibility among developers, QA teams, and SREs, utilizing chaos hubs for experiment development, resilience metrics for measurement, and security governance for safe experimentation.

Chaos engineering is the science of injecting faults and verifying the steady state of the system. In this article, we will not delve into the concepts of chaos engineering but will look into a more modern implementation approach called continuous resilience. To understand the basics of chaos engineering, refer to the article here. Traditionally chaos engineering has been known for verifying the resilience of critical systems and services in production. Recently, chaos engineering has been used to ensure resilience in the entire SDLC spectrum. Ensuring resilience in all stages of SDLC is the most efficient way to deliver the maximum availability of business-critical services to the end users.

Chaos engineering as a concept is well understood by most of the audience. Some of the challenges today in chaos engineering area are related to implementing and scaling the practice across the organization, measuring the success of such chaos experimentation efforts, and knowing what it takes to get to that final milestone of resilience in terms of timelines and efforts. The reason for these challenges is, traditionally, chaos engineering has been taught as an exercise of careful planning and running of a set of experiments in production using the GameDay approach. The success of GameDay plays into the hands of a few individuals who are responsible for designing and orchestrating the execution from time to time and are not automated like other regular quality or performance tests.

‍

Continuous Resilience Approach

In modern chaos engineering practice, developers and QA teams share the chaos experiment development. The tests are automated in all environments and are run by all personas: Developers, QA teams, and SREs. The focus on resilience is built into every gate of SDLC, which leads us to the term Continuous Resilience. In the continuous resilience approach, we expect most chaos experiment runs to happen in the pipelines. However, continuous resilience is NOT just running chaos in pipelines; it is about automating the chaos experiment runs in all environments - Dev, QA, Pre-Prod, and Prod, though at various degrees of rigor.

GameDays can still be used along with automated chaos engineering experimentation, especially in critical systems where resilience needs to be tested on a need basis. GameDays also provide a means to validate documentation, recovery procedures, and train engineers on incident response best practices.

‍

Basic Tenets of Continuous Resilience^TM

When compared to the GameDay approach of chaos engineering, the continuous resilience approach is built around the following tenets:

Chaos Experiment Development using Chaos Hubs
Adoption of the new resilience metrics
Integration into your SDLC or pipelines
Security Governance for chaos experimentation

‍

1. Chaos Experiment Development using Chaos Hubs

Well-written chaos experiments are the building blocks of a successful chaos engineering practice. These experiments must be upgraded or modified continuously to keep up with software or infrastructure configuration changes. This is why chaos experimentation is called chaos “engineering.” The best practice would be to manage the lifecycle of chaos code similar to that of the regular software code - this means that the chaos experiment code is developed and maintained in a source code repository, tested for false positives and negatives, tested in the dev environment, and promoted for use in larger environments.

Chaos experiments should be easily tunable, importable, exportable, and shareable by multiple members of a team or by multiple teams of an organization. Chaos Hubs or centralized repositories of chaos experiments with the above features should be utilized for chaos experiment development and maintenance.

In this method, you actively involve the QA team members and developers in writing the chaos experiment development and not limiting it to just the SREs, as in the traditional GameDay approach.

‍

2. Adoption of Resilience Metrics

A rollout of chaos practice will usually be associated with business goals such as “% decrease in production incidents” or “% decrease in recovery times,” etc. These metrics are more or less tied to production systems and are handled by the Ops teams or SREs. For example, you cannot measure uptimes and recovery times in QA test beds or pipelines. In the continuous resilience approach, where all personas are involved in chaos experimentation, the metrics to measure should be relevant to both Developers/QA teams/QA pipelines and Pre-production/Production/SREs. For this need, two new metrics are suggested:

- Resilience Scores

- Resilience Coverage

‍

Resilience Score: The experiment may contain one or more faults against one or more services. The resilience score can be tied to a chaos experiment or a service.

Resilience scores can be used in pipelines, QA systems, and production by all the personas.

Resilience Scores in Chaos Engineering

‍

Resilience Coverage: While the resilience score covers the actual resilience of a service or an experiment, resilience coverage shows how many more chaos experiments are needed to declare the entire system as checked for resilience. It is similar to code coverage in software testing. Together with resilience scores, resilience coverage gives complete control over the resilience measurement. Resilience coverage applies to a service or a system.

Resilience Coverage in Chaos Engineering

‍

The resilience coverage metric has to be used to create a practical number of chaos experiments, as it can be argued that the number of possible chaos experiments that can be created against a service can be overwhelming.

3. Integrations into the SDLC or pipelines

Automation of chaos experiments is critical for the achievement of continuous resilience. Resilience coverage always starts at a low number, typically <5%, where the experiments are very safe to run or run against critical services. In both cases, the assurance of resilience is essential when new code is introduced via the pipelines. Hence, these experiments must be inserted into the regular pipelines that verify the sanity or quality of the rollout process. As the chaos experiments are automated, with every development cycle, the resilience coverage will be increased in steps, eventually reaching 50% or more in a few months or quarters, depending on the dev time the dev/qa teams spend.

Suppose a pipeline reaches 80% resilience coverage with >80% resilience scores. In that case, the risk of a resilience issue or outage happening in production from known sources is mitigated to a large extent, leading to improved reliability metrics such as MTTF (Mean Time To Fail).

Pipeline Policies can be set up to mandate the insertion of chaos experiments into the pipeline.

Automating pipelines and measuring the coverage through policies mandates the development of chaos experiments by team members who know the product best, i.e., developers and QA team members. SREs can then use these experiments to enhance production-grade environments. This is the organic process in which an organization's chaos engineering practice matures from Level 1 to Level 4.

‍

Chaos integration into the pipelines can be done in many ways. The following are some options:

Choose deployment pipelines that deploy to the system test bed or pre-production and run the chaos experiment suite after the changes are deployed. If the resulting resilience scores or resilience coverage is unsatisfactory, take appropriate action, such as manual inspection and approval of the pipelines or automatic rollback of the changes.
Choose your feature flag pipelines that deploy the feature flags where a new feature is being rolled out into pre-production or production and run the chaos experiment suite on the environment where the flags are enabled. If the resulting resilience scores or resilience coverage is not satisfactory, then take appropriate action such as manual inspection and approval of the pipelines or automatic rollback of the flags.
It is common to insert the basic resilience tests in the lowest environments, such as dev pipelines that deploy the code to the dev-test environment. However, the resilience coverage is expected to be low as the dev environments are maintained with minimal service configurations.

‍

4. Secure Chaos Experiment Orchestration

Chaos experimentation is generally perceived as disruptive because it may cause unwanted delays if the services are brought down unexpectedly. For example, someone bringing down all nodes of a Kubernetes cluster in a QA environment through a chaos experiment does not help in finding a weakness, but it can cause enormous interruptions for the QA process and its timelines. Similarly, chaos experimentation processes should be associated with the required guard rails or security governance policies.

Examples of security governance around chaos experiments:

Allow node faults only on production-infrastructure-1
Do not allow network loss faults on prod-2 between 9 AM and 9 PM
Allow only the admin-group to inject node restarts
Do not allow the users-group to inject network faults on qa-2 infrastructure

With security governance policies like those above, chaos experimentation becomes more practical at scale, and automation can increase resilience coverage.

‍

Get started with Harness Chaos Engineering

Harness Chaos Engineering is built with the above building blocks needed to roll out the Continuous Resilience^TM approach of chaos engineering. It comes with many out-of-the-box faults, security governance, chaos hubs, the ability to integrate with CD pipelines and Feature Flags, and many more.