.png)
Outages in distributed systems are inevitable, making resilience testing essential in the SDLC. It must be continuous, covering failures, load, and disasters. Delayed validation creates “resilience debt,” increasing risk. A holistic approach—combining chaos, load, and DR testing—plus cross-team collaboration and AI-driven insights improves reliability and reduces impact.
Modern software delivery has dramatically accelerated. AI-assisted development, automated CI/CD pipelines, and cloud-native architectures have made it possible for teams to deploy software dozens of times per day.
But speed alone does not guarantee reliability.
At Conf42 Site Reliability Engineering (SRE) 2026, Uma Mukkara, Head of Resilience Testing at Harness and co-creator of LitmusChaos, delivered a clear message: outages are inevitable. In modern distributed systems, assuming your design will always work is not just optimistic—it’s risky.
In fact, as Uma put it, failure in distributed systems is a mathematical certainty.
That’s why resilience testing must become a core, continuous practice in the Software Development Life Cycle (SDLC).
The Reality of Inevitable Outages
Even the most reliable cloud providers experience outages.
Uma illustrated this with examples that highlight how unpredictable failures can be:
- Physical disruption such as drone strikes affecting AWS Middle East data centers
- Policy or configuration errors that triggered cascading outages on cloud platforms like Azure
- Retry storms and load spikes where services collapse under unexpected demand
These incidents demonstrate an important reality: the types of failures constantly evolve.
A system validated during design may not be resilient against tomorrow’s failure scenarios. Architecture may stay the same, but the failure patterns surrounding it continuously change.
This is why resilience cannot rely on assumptions.
Hope is not a strategy—verification is.
For a deeper look at this broader approach to resilience, see how chaos engineering, load testing, and disaster recovery testing work together.
What Resilience Really Means
Resilience is often misunderstood as simply keeping systems online.
But uptime alone does not make a system resilient.
Uma defines resilience more precisely:
Resilience is the grace with which systems handle failure and return to an active state.
In practice, a resilient system must handle three categories of disruption:
1. System Failures
Pod crashes, node failures, infrastructure disruptions, or network faults.
2. Load Conditions
Traffic spikes or sudden demand that pushes systems to their limits.
3. Disasters
Regional outages, multi-AZ failures, or infrastructure loss that require recovery mechanisms.
If teams test only one of these dimensions, they leave significant risks undiscovered.
True resilience requires verifying how systems behave across all three scenarios.
Continuous Verification in the SDLC
One of the biggest challenges Uma highlighted is how organizations treat resilience.
Many teams still see it as a “day-two problem”—something SREs will handle after systems are deployed.
Others assume that once resilience has been validated during system design, the problem is solved.
In reality, resilience must be continuously verified.
As systems evolve with each release, so do their failure modes. The most effective strategy is to:
- Test resilience continuously
- Verify resilience with every delivery
- Document results across releases
This approach shifts resilience testing into the outer loop of the SDLC, alongside functional and performance testing.
Instead of waiting for production incidents, teams proactively identify weaknesses before customers experience them.
Understanding Resilience Debt
Uma introduced an important concept: resilience debt.
Resilience debt is similar to technical debt. When teams postpone resilience validation, they leave hidden risks unresolved in the system.
Over time, that debt accumulates.
And when failure eventually occurs—which it inevitably will—the business impact grows proportionally to the resilience debt that was ignored.
The only way to reduce this risk is to steadily increase resilience testing coverage over time.
As testing matures across multiple quarters, organizations gain better feedback about system behavior, uncover more risks earlier, and continuously reduce the likelihood of severe outages.
A Holistic Approach to Resilience Testing
Another key takeaway from Uma’s session is that resilience testing should not happen in silos.
Many organizations treat chaos testing, load testing, and disaster recovery validation as separate initiatives owned by different teams.
But the most meaningful risks often appear when these scenarios intersect.
For example:
- A resource bottleneck might only appear when high traffic coincides with a service failure.
- Chaos experiments developed for reliability testing can also be reused in disaster recovery workflows.
- Combining chaos and load tests helps teams observe system behavior at failure limits under real-world conditions.
That’s why resilience testing must be approached as a holistic practice combining:
- Chaos Engineering
- Load Testing
- Disaster Recovery (DR) Validation
You can explore the fundamentals of resilience testing in the Harness documentation.
Collaboration Across Teams
Resilience testing also requires collaboration across multiple roles.
Developers, QA engineers, SREs, and platform teams all contribute to validating system reliability.
Uma pointed out that many organizations already share infrastructure for testing but run different experiments independently. By coordinating these efforts, teams can:
- reuse testing environments
- share chaos experiments across testing scenarios
- validate DR workflows more frequently
- improve testing efficiency across teams
Resilience becomes significantly stronger when personas, environments, and test assets are shared rather than siloed.
The Role of AI in Resilience Testing
As systems become more complex, another challenge emerges: knowing what to test and when.
Large organizations may have hundreds of potential experiments, making it difficult to prioritize testing effectively.
Uma described how agentic AI systems can help address this challenge.
By analyzing internal knowledge sources such as:
- incident data
- CI/CD pipeline history
- infrastructure configuration
- operational documentation
AI systems can recommend:
- the most relevant chaos experiments
- appropriate load testing scenarios
- disaster recovery tests that should run at a given time
These recommendations allow teams to run the right tests at the right moment, improving resilience coverage without overwhelming engineering teams.
A Unified Platform for Resilience Testing
To support this holistic approach, Harness has expanded its original Chaos Engineering capabilities into a broader platform: Harness Resilience Testing.
The platform integrates multiple testing disciplines in a single environment, enabling teams to:
- design chaos experiments
- run load tests
- validate disaster recovery workflows
- observe system risk patterns in one place
By combining these capabilities, teams gain a single pane of glass for identifying resilience risks across the SDLC.
This unified view allows organizations to track trends in system reliability and proactively address weaknesses before they turn into production incidents.
Resilience Is a Core Practice for Modern SRE Teams
Uma closed the session with a clear conclusion.Resilience testing is not optional.
Outages will happen. Infrastructure will fail. Traffic patterns will change. Dependencies will break.
What matters is whether organizations have continuously validated how their systems behave when those failures occur.
The more resilience testing coverage teams build over time, the more feedback they receive—and the lower the potential business impact becomes.
In modern software delivery, resilience is no longer just a reliability practice.
It is a core discipline of the enterprise SDLC.
Ready to start validating your system’s resilience?
Explore Harness Resilience Testing and start validating reliability across your SDLC.
