Software and system development are exercises in innovation and solving the unknown. Software and systems are fallible because they are made by humans (and most likely, multiple humans) with varying opinions and skills. Technology is getting more distributed and complex especially with the push to microservices. Rarely will one person have the total end-to-end knowledge of the entire system.
Similar to the military term around situational awareness, the fog of war, in modern development understanding the total impact of changes can be difficult (the fog of development). Coupled with user expectations for systems to be available at all times, testing system robustness and resiliency for unknowns can just be that: an unknown.
Chaos engineering helps address the unknown by injecting failures throughout the application and infrastructure stack and then allowing engineers to validate behaviors and make adjustments so the failures don’t manifest themselves to the users. Coupled with the rise of site reliability engineering (SRE) practices, chaos engineering tries to calculate the impact of the improbable.
A popular piece of reading for Site Reliability Engineers is Nicholas Taleb’s The Black Swan: The Impact of the Highly Improbable (2007). Taleb introduces the black swan metaphor in his piece. Taleb would classify black swan events such as a sudden natural disaster or in business at the time of publishing his book, Google’s astounding and unprecedented success. A black swan event has three characteristics: it’s unpredictable, its impact is massive, and when it’s over, we devise an explanation that makes the black swan seem less random.
When dealing with the fog of development, we are prone to the fallacies of distributed computing, which are a set of pivotal assertions made by computer scientist Peter Deutsch. Some of the top fallacies are: the network is reliable, latency is zero, bandwidth is infinite, and there is only one administrator. Distilling the fallacies down, your services will be consistent and there would be availability at all times. As we know, systems and services come up and down all the time - but when getting into the minutia of developing the unknown, we can easily forget this.
For example, let’s say we are building some features that rely on Amazon S3 for object storage. If we are building features for a service that does complex processing and the final output is writing or updating an object in S3, we as engineers might assume that S3 will be there. We test our features up and down and provide less sophisticated test coverage to the S3 portion. Amazon Web Services had a black swan event of its own in 2017 when S3 suffered an outage. Something that we assumed would be there (even with a lowered performance/write SLA) was not, and the fallacies of distributed computing came back to bite us.
The S3 outage really helped to shine a light on making sure we touch all parts of our stack, even if the parts we touch don’t seem obvious, perhaps due to our perception/fog around the fallacies of distributed computing. Chaos engineering, and chaos experiments, brings controlled chaos so we can shake these types of events out.
Chaos engineering is the science behind intentionally injecting failure into systems to gauge resiliency. Like any scientific method, Chaos engineering focuses on experiments/hypotheses and then compares the results to a control (a steady state). The quintessential Chaos engineering example in a distributed system is taking down random services to see how items respond and what detriment to the user journey manifests.
If you take a cross-section of what an application needs to run (compute, storage, networking, and application infrastructure), injecting a fault or turbulent conditions into any part of that stack are valid Chaos engineering experiments. Network saturation or storage that suddenly becomes volatile/at capacity are known failures in the technology industry, but Chaos engineering allows for much more controlled testing of these failures, and more. Because of the wide swath of infrastructure that can be impacted, users and practitioners of Chaos engineering can be almost anyone supporting the application/infrastructure stack.
There can be multiple stakeholders in Chaos engineering experiments because of the wide swath of technology and decisions that Chaos engineering touches. The wider the blast radius (what is impacted with the tests and experiments), the more stakeholders will be involved.
Depending on what domain of the application stack (compute, networking, storage, and application infrastructure) and where the targeted infrastructure lives, stakeholders from those teams could get involved.
If the blast radius is small and can be tested in a running container, the application development team can test without fear of breaking out of the container. If the workload or infrastructure has a wider blast radius (for example, testing Kubernetes infrastructure), platform engineering teams most likely will get involved. Providing coverage for the unknown is the core reason for running Chaos tests and looking for weaknesses.
The fog of development is quite real, especially with larger distributed systems, complex systems, and microservice implementations. From an application perspective, each individual microservice might be tested individually and determined to be working as designed. Normal monitoring techniques could deem an individual service as healthy.
With microservice patterns, a single request can traverse several services for an aggregate response to fulfill what the user or other services requested. Every remote request between the services is traversing additional infrastructure and crossing different application boundaries, all open to failure.
If a trivial or non-trivial service or piece of infrastructure does not respond within the service level agreement (SLA), how are the system’s capability and user journey impacted? This is the exact question Chaos engineering is solving. The results of the Chaos engineering experiments are then worked on to create a more resilient system.
The Principles of Chaos Engineering is an excellent manifesto that describes the main goals and principles of chaos engineering. The Principles of Chaos Engineering further break down four practices that are similar to the scientific method. Though, unlike the scientific method, the assumption is that the system is stable and then looks for the variance. The harder to interrupt the steady-state, the more confidence and robustness is in the system.
Knowing what normal/steady is is critical in detecting deviation/regression. Depending on what you are testing for, having a good metric, such as response time or higher-level goals such as the ability to complete the user journey in a certain time, are good measures of normalcy. The steady-state in an experiment is the control group.
Going against the grain of the scientific method, assuming a hypothesis is true all the time does not leave much leeway. Chaos engineering is designed to be run against robust and steady systems, trying to find faults such as application failures or infrastructure failures. Running Chaos engineering against unsteady systems does not provide much value, since those systems are already unreliable and instability is known.
Like any science experiment, chaos engineering introduces variables into the experiment to see how the system responds. These experiments represent real-world failure scenarios impacting one or more of the four pillars of an application: compute, networking, storage, and application infrastructure. A failure, for example, could be a hardware failure or network interruption.
If the hypothesis is for a steady-state, any variance or disruption from the steady-state (differences between the control and experiment group) disproves the hypothesis of stability. By now having an area to focus on, fixes or design changes can be made to make a more robust and stable system.
Implementing the principles of chaos engineering leads to a few design considerations and best practices when implementing chaos engineering experiments.
There are three pillars when implementing chaos engineering, or any tests, for that matter. The first is providing adequate coverage, the second is making sure experiments are run often and mimicked/are run on production, and the third is minimizing the blast radius.
In software, you will never achieve 100% test coverage. Building coverage takes time, and accounting for every particular scenario is a pipe dream. Coverage works on making what is most impactful to test. In chaos engineering, that would be testing for items that would have a grave impact, like storage not being available or for items that could occur a lot, like network saturation or network failures.
Software, systems, and infrastructure does change - and the condition/health of each can change pretty rapidly. A good place to run an experiment is in your CI/CD pipeline. CI/CD pipelines are executed when a change is being made. No better time to measure the potential impact of change than when the change is starting its confidence-building journey in a pipeline.
As scary of a thought about testing in production is, production is the environment that users are in and traffic spikes/load are real. To fully test the robustness/resilience of a production system, running chaos engineering experiments in a production environment will provide needed insights.
Because you can’t bring down production in the name of science, limiting the blast radius of the chaos engineering experiments is a responsible practice. Focus on small experiments that will tell you what you want to identify. Focus on scope and tests. For example, network latency between two specific services. Game Day planning can help calculate blast radius and what to focus on.
With these best practices, chaos engineering is a discipline that is different from load testing.
Certainly, load can bring on chaos per se. We commonly design our systems to be elastic in multiple pieces (spinning up additional compute, networking, persistence, and/or application nodes to cope with the load). That is assuming that everything comes up at the same/appropriate time, so we can get ahead of the load.
In the computer science world, the Thundering Herd problem is not new, but manifests itself more commonly as we move towards more distributed architecture. A Thundering Herd problem, for example, could be at the machine level as a large number of processes are kicked off, and another process becomes the bottleneck (the ability to handle one and only one of the new processes). In a distributed architecture, a Thundering Herd might be that your messaging system is able to ingest a large number of messages/events at a time, but processing/persisting those messages might become a bottleneck. If you are overrun with messages, hello Thundering Herd.
A load test would certainly help us prepare for a Thundering Herd as one type of stress, but what if part of the system was not even there, or was late to the game? That’s where chaos engineering comes in. A very hard item to test for would be a cascading failure without chaos engineering. Historically more equated with the power grid, a cascading failure is a failure of one part that can trigger failures in other parts. In distributed system land, this is us trying to find a single point of failure and making sure our application/infrastructure is robust enough to handle failures.
There has been a lot of advancement and tooling around chaos engineering. You can find great technical resources on the Awesome Chaos Engineering list, and we’ve created a detailed look at the top chaos engineering tools. This list pays homage to the chaos testing tools that built chaos engineering, and to new platforms making chaos engineering easier to consume.
Harness recognized the need to enable developers to proactively address unplanned downtime. We introduced Harness Chaos Engineering as a solution for engineering AND reliability teams to proactively improve reliability. Harness Chaos Engineering enables DevOps and software reliability engineering (SRE) teams to collaborate and run chaos tests to identify reliability issues in their deployments. The capabilities provided by our chaos engineering tool enable developers to have continuous reliability validation in their pipeline instead of a single snapshot of reliability during manual GameDay tests that happen infrequently.
As newer ways of looking at building confidence in your systems start to gain traction, your CI/CD pipeline is a great spot to be orchestrating confidence-building steps. Chaos engineering experiments are great to be run in your CI/CD pipeline.
The art of the possible is to either have the results of a chaos experiment influence the deployment, or if deploying to lower environments, have Harness act as an orchestrator for experiments and other automated tests.
The Harness software delivery platform is a very robust platform that is purpose-built for orchestrating confidence-building steps. Like in any experiment, a pillar to chaos engineering is having a baseline. Imagine you are new to a team, such as an SRE team, that has coverage for dozens of applications that you have not written yourself. Running chaos tests for the first time would require either to isolate or spin up a new distribution of an application and associated infrastructure to experiment without production-impacting repercussions.
If your applications are not deployed through a robust pipeline, creating another segregated deployment could be as painful as the normal ebbs and flows of deploying the application normally. Moving along the chaos engineering maturity journey, as chaos tests are viewed as mandatory coverage, integrating them into a Harness Workflow for the judgment call or failure strategy is simple by convention.
Want to learn more about chaos engineering tools? Read our article today!
Enjoyed reading this blog post or have questions or feedback?
Share your thoughts by creating a new topic in the Harness community forum.