“Hope is not a strategy.” This quote embodies the core philosophy of chaos engineering. We can’t just sit around and hope that our business never experiences a costly service disruption. It’s essential to act now and prepare for the worst by adding chaos chaos engineering to your disaster recovery (DR) testing.
In the cloud-native world, chaos engineering is a necessity for companies, so they can be prepared for increasingly common disruptive events resulting from multiple causes, including natural disasters, cyber attacks, and unexpected technology failures. Whether you’ve already got a disaster plan in place or are just getting started, chaos engineering provides your team with an extra layer of proactive preparation.
Service disruptions happen in all industries. While these outages are relatively brief in the majority of cases, when they’re big, they can significantly impact an organization’s bottom line and reputation. Here are a couple of recent examples that resulted in some seriously undesirable headlines and business impact:
Avoiding massive unplanned outages means taking a proactive approach to disaster recovery right now. One of the first steps is to create a disaster recovery plan (DRP). The DRP implementation is an overarching exercise encompassing technology, people, and processes. Together, these result in a playbook to achieve efficient recovery.
An important part of creating an effective DRP involves engineering teams collaborating with business leaders to list resources in the order of criticality and potential failure points associated with them (also known as a business impact analysis). Then, you need to simulate the failures and verify theoretical recovery paths (either automated or manual).
At Harness, we use the best practice of building a service map, including a listing of the criticality of its components, incident history, the code or binary components, and an understanding of the underlying infrastructure components with associated dependencies. In its simplest form, the service map needs to outline the tech stack including databases, cache, message brokers, and dependencies. This enables the team to understand the architecture of the system and what chaos experiments should be tested.
One of the benefits of chaos testing is gaining an accurate understanding of certain key metrics.
DR in the coud-native world comprises both the traditional active-passive model, as well as the now-widely adopted active-active model, with application deployment topologies featuring cross-zone or cross-region replicas. The takeaways or metrics from chaos tests differ for both the above DR models:
Chaos experimentation as part of DRP is often conducted as gamedays (also called fire drills) with multiple stakeholders participating. The chaos scenarios implemented within these gamedays are expected to increase in blast-radius configuration as the recovery paths are solidified.
Chaos engineering helps organizations minimize financial and reputational impact associated with unplanned downtime. It also enables developers to focus on software delivery rather than fire-fighting production incidents.
Chaos experiments go beyond traditional unit, integration, and system tests, and more closely represent what random failures in a real-world, production environment would look like. This realistic environment provides insight into how systems behave, equipping teams to understand weak links that exist in the applications and infrastructure, and proactively creating resilience to help prevent costly downtime.
The Harness Chaos Engineering (CE) module helps engineering and reliability teams navigate the risks of unplanned downtime by helping them identify system weaknesses and improve reliability by purposely creating failure scenarios (i.e., chaos).
Getting started with chaos engineering has never been so simple. If you are ready to see how your organization can adopt this practice and start improving reliability, request a Harness CE demo or start your SaaS trial today!
Enjoyed reading this blog post or have questions or feedback?
Share your thoughts by creating a new topic in the Harness community forum.