Chaos engineering is crucial for building resilient systems, but scaling these practices requires specialized expertise that creates bottlenecks in many organizations. The AI Reliability Agent in Harness Chaos Engineering addresses this challenge by using artificial intelligence to automate experiment recommendations, provide remediation guidance, and make chaos engineering accessible to teams at any skill level. This experimental feature works with Kubernetes infrastructures and integrates seamlessly into existing Harness workflows.
Chaos engineering has proven essential for building resilient systems, but scaling these practices across teams remains challenging. The biggest hurdle? Not everyone has the specialized knowledge needed to create effective chaos experiments, interpret results, or implement fixes when problems are discovered.
To address this challenge, Harness has developed an AI-powered approach within their Chaos Engineering module. The AI Reliability Agent leverages artificial intelligence to make chaos engineering more accessible and effective for teams at different skill levels.
Most organizations face similar challenges when trying to expand their chaos engineering practices. Teams need to develop expertise in creating meaningful experiments with appropriate parameters, running them effectively, and most importantly, knowing what to do when experiments reveal system weaknesses.
This learning curve often creates bottlenecks where only a few team members can effectively use chaos engineering tools, limiting how widely these practices can be adopted across the organization.
The AI Reliability Agent in Harness Chaos Engineering addresses these challenges by automating many of the decision-making processes that traditionally required deep expertise. Instead of teams figuring everything out from scratch, the AI provides intelligent guidance based on your specific environment and infrastructure patterns
Currently, the agent works with Kubernetes infrastructures that are driven by Harness Delegate, where there's enough standardization to provide meaningful recommendations.
The AI analyzes your environment monitoring data and recommends new chaos experiments with pre-tuned parameters. Rather than guessing which experiments might be valuable, teams get specific suggestions tailored to their infrastructure's characteristics and potential failure modes.
Instead of running experiments randomly, the agent provides strategic guidance on which specific experiments to run, complete with clear reasoning about what resilience aspects are being verified. This helps teams focus their testing efforts on the most impactful areas.
When chaos experiments reveal weaknesses through failed probes, the AI doesn't just identify problems. It provides customized fix recommendations specifically designed to improve application resilience, turning discovered vulnerabilities into actionable improvement opportunities.
The agent streamlines the entire process by allowing teams to create recommended experiments or apply suggested fixes with minimal effort, reducing the friction between insight and action.
Setting up the AI Reliability Agent in Harness is straightforward, though it requires coordination with your Harness account team since this is currently an experimental feature.
The first step is reaching out to your Harness sales representative to enable the AI Reliability Agent feature flag for your account. Since this is an experimental feature under the CHAOS_AI_RECOMMENDATION_DEV flag, it's not available by default.
Once the feature flag is enabled, configuration happens within your existing Harness Chaos Engineering module:
Navigate to Your Environment In the Harness platform, go to the Chaos Engineering module and select "Environments" from the left navigation menu. Choose the environment where you want to enable AI capabilities.
Enable AI for Infrastructure Select an existing Kubernetes infrastructure and access the edit options through the "More Options" menu. In the infrastructure edit panel, you'll find an "Enable AI" toggle at the top of the interface.
Activate and Save Turn on the toggle to enable the Harness AI Agent to perform tasks on this infrastructure, then save your changes. The AI Reliability Agent will immediately begin analyzing your experiment results and providing recommendations.
You can easily identify which infrastructures have AI enabled by looking for the "AI Enabled" badge next to their names in the infrastructure list.
While the AI Reliability Agent provides powerful automation capabilities, it's important to understand how it works. The agent may leverage public LLMs such as OpenAI when generating fix recommendations, so you should always validate these suggestions with your application or infrastructure experts before implementing them in production.
The goal is to augment human expertise, not replace it. The AI provides intelligent recommendations, but the final decisions about implementation should always involve people who understand your specific systems and business requirements.
What makes this approach particularly valuable is how it integrates with the broader Harness platform. Teams can leverage AI recommendations within their existing chaos engineering workflows without having to learn new tools or processes. The AI works behind the scenes, analyzing patterns and providing guidance through the same interface teams are already using.
Beyond the AI Reliability Agent, Harness has developed Model Context Protocol (MCP) tools for Chaos Engineering that extend AI integration even further. These tools allow you to integrate chaos engineering capabilities directly with popular AI development environments like Windsurf, VSCode, Claude Desktop, and Cursor.
This means you can interact with your chaos engineering workflows using natural language directly within your preferred development tools. Whether you're planning experiments, analyzing results, or implementing fixes, the MCP tools provide a seamless bridge between your AI assistant and Harness Chaos Engineering capabilities. Check out the video tutorial and blog about it.
The AI Reliability Agent represents an interesting evolution in chaos engineering tooling. By making these practices more accessible, tools like this can help more teams adopt resilience testing without requiring everyone to become chaos engineering experts.
As distributed systems continue to grow in complexity, having intelligent assistance for reliability testing becomes increasingly valuable. The combination of proven chaos engineering principles with AI guidance offers a practical path for organizations using Harness to scale their resilience practices effectively.
For teams already using Harness Chaos Engineering, the AI Reliability Agent provides a natural next step in evolving their reliability practices. The key is finding the right balance between automation and human oversight, ensuring that AI enhances capabilities while maintaining the critical thinking that effective chaos engineering requires.
New to Harness Chaos Engineering ? Signup here
Trying to find the documentation for Chaos Engineering ? Go here
Explore four levels of chaos engineering maturity to enhance software reliability. Learn organizational roles and assess your maturity level.