Infrastructure as Code (IaC) has revolutionized how we manage and provision infrastructure. But what about chaos engineering? Can you automate the setup of your chaos experiments the same way you provision your infrastructure?
The answer is yes. In this guide, I'll walk you through how to integrate Harness Chaos Engineering into your infrastructure using Terraform, making it easier to maintain resilient systems at scale.
Why Automate Chaos Engineering?
Before diving into the technical details, let's talk about why this matters.
Managing chaos engineering manually across multiple environments is time-consuming and error-prone. You need to set up infrastructures, configure service discovery, manage security policies, and maintain consistency across dev, staging, and production environments.
With Terraform, you can:
- Version control your entire chaos engineering setup
- Replicate configurations across environments reliably
- Integrate chaos engineering into your existing IaC workflows
- Collaborate with your team using familiar tools
What You Can Automate
The Harness Terraform provider lets you automate several key aspects of chaos engineering:
Infrastructure Setup - Enable chaos engineering on your existing Kubernetes clusters or provision new ones with chaos capabilities built in.
Service Discovery - Automatically detect services that can be targeted for chaos experiments, eliminating manual configuration.
Image Registries - Configure custom image registries for your chaos experiment workloads, giving you control over where container images are pulled from.
Security Governance - Define and enforce policies that control when and how chaos experiments can run, particularly important for production environments.
ChaosHub Management - Manage repositories of reusable chaos experiments, probes, and actions at the organization or project level.
Getting Started
Before you begin, make sure you have:
- Terraform installed and configured
- The Harness Terraform provider set up (see the official documentation)
- A Kubernetes infrastructure where you want to enable chaos engineering
Currently, the Harness Terraform provider for chaos engineering supports Kubernetes infrastructures.
Building Your Configuration
Let's walk through the key resources you'll need.
Setting Up Common Configuration
Start by defining common variables that will be used across all your resources:
locals {
org_id = var.org_identifier != null ? var.org_identifier : harness_platform_organization.this[0].id
project_id = var.project_identifier != null ? var.project_identifier : (
var.org_identifier != null ? "${var.org_identifier}_${replace(lower(var.project_name), " ", "_")}" :
"${harness_platform_organization.this[0].id}_${replace(lower(var.project_name), " ", "_")}"
)
common_tags = merge(
var.tags,
{
"module" = "harness-chaos-engineering"
}
)
tags_set = [for k, v in local.common_tags : "${k}=${v}"]
}
This approach keeps your configuration DRY and makes it easy to reference organization and project identifiers throughout your setup.
Creating Organization and Project
If you don't have an existing organization or project, Terraform can create them:
resource "harness_platform_organization" "this" {
count = var.org_identifier == null ? 1 : 0
identifier = replace(lower(var.org_name), " ", "_")
name = var.org_name
description = "Organization for Chaos Engineering"
tags = local.tags_set
}
resource "harness_platform_project" "this" {
depends_on = [harness_platform_organization.this]
count = var.project_identifier == null ? 1 : 0
org_id = local.org_id
identifier = local.project_id
name = var.project_name
color = var.project_color
description = "Project for Chaos Engineering"
tags = local.tags_set
}
Setting Up Kubernetes Connector
Connect your Kubernetes cluster to Harness:
resource "harness_platform_connector_kubernetes" "this" {
depends_on = [harness_platform_project.this]
identifier = var.k8s_connector_name
name = var.k8s_connector_name
org_id = local.org_id
project_id = local.project_id
inherit_from_delegate {
delegate_selectors = var.delegate_selectors
}
tags = local.tags_set
}
Creating Environment and Infrastructure
Set up your environment and infrastructure definition:
resource "harness_platform_environment" "this" {
depends_on = [
harness_platform_project.this,
harness_platform_connector_kubernetes.this
]
identifier = var.environment_identifier
name = var.environment_name
org_id = local.org_id
project_id = local.project_id
type = "PreProduction"
tags = local.tags_set
}
resource "harness_platform_infrastructure" "this" {
depends_on = [
harness_platform_environment.this,
harness_platform_connector_kubernetes.this
]
identifier = var.infrastructure_identifier
name = var.infrastructure_name
org_id = local.org_id
project_id = local.project_id
env_id = harness_platform_environment.this.id
deployment_type = var.deployment_type
type = "KubernetesDirect"
yaml = <<-EOT
infrastructureDefinition:
name: ${var.infrastructure_name}
identifier: ${var.infrastructure_identifier}
orgIdentifier: ${local.org_id}
projectIdentifier: ${local.project_id}
environmentRef: ${harness_platform_environment.this.id}
type: KubernetesDirect
deploymentType: ${var.deployment_type}
allowSimultaneousDeployments: false
spec:
connectorRef: ${var.k8s_connector_name}
namespace: ${var.namespace}
releaseName: release-${var.infrastructure_identifier}
EOT
tags = local.tags_set
}
Enabling Chaos Infrastructure
Now enable chaos engineering capabilities on your infrastructure:
resource "harness_chaos_infrastructure_v2" "this" {
depends_on = [harness_platform_infrastructure.this]
org_id = local.org_id
project_id = local.project_id
environment_id = harness_platform_environment.this.id
infra_id = harness_platform_infrastructure.this.id
name = var.chaos_infra_name
description = var.chaos_infra_description
namespace = var.chaos_infra_namespace
infra_type = var.chaos_infra_type
ai_enabled = var.chaos_ai_enabled
insecure_skip_verify = var.chaos_insecure_skip_verify
service_account = var.service_account_name
tags = local.tags_set
}
Automating Service Discovery
Service discovery eliminates the need to manually register services for chaos experiments:
resource "harness_service_discovery_agent" "this" {
depends_on = [harness_chaos_infrastructure_v2.this]
name = var.service_discovery_agent_name
org_identifier = local.org_id
project_identifier = local.project_id
environment_identifier = harness_platform_environment.this.id
infra_identifier = harness_platform_infrastructure.this.id
installation_type = var.sd_installation_type
config {
kubernetes {
namespace = var.sd_namespace
}
}
}
Once deployed, the agent will automatically detect services running in your cluster, making them available for chaos experiments.
Configuring Custom Image Registries
For organizations that use private registries or have specific image sourcing requirements, you can configure custom image registries at both organization and project levels:
resource "harness_chaos_image_registry" "org_level" {
depends_on = [harness_platform_organization.this]
count = var.setup_custom_registry ? 1 : 0
org_id = local.org_id
registry_server = var.registry_server
registry_account = var.registry_account
is_default = var.is_default_registry
is_override_allowed = var.is_override_allowed
is_private = var.is_private_registry
secret_name = var.registry_secret_name != "" ? var.registry_secret_name : null
use_custom_images = var.use_custom_images
dynamic "custom_images" {
for_each = var.use_custom_images ? [1] : []
content {
log_watcher = var.log_watcher_image != "" ? var.log_watcher_image : null
ddcr = var.ddcr_image != "" ? var.ddcr_image : null
ddcr_lib = var.ddcr_lib_image != "" ? var.ddcr_lib_image : null
ddcr_fault = var.ddcr_fault_image != "" ? var.ddcr_fault_image : null
}
}
}
resource "harness_chaos_image_registry" "project_level" {
depends_on = [harness_chaos_image_registry.org_level]
count = var.setup_custom_registry ? 1 : 0
org_id = local.org_id
project_id = local.project_id
registry_server = var.registry_server
registry_account = var.registry_account
is_default = var.is_default_registry
is_override_allowed = var.is_override_allowed
is_private = var.is_private_registry
secret_name = var.registry_secret_name != "" ? var.registry_secret_name : null
use_custom_images = var.use_custom_images
dynamic "custom_images" {
for_each = var.use_custom_images ? [1] : []
content {
log_watcher = var.log_watcher_image != "" ? var.log_watcher_image : null
ddcr = var.ddcr_image != "" ? var.ddcr_image : null
ddcr_lib = var.ddcr_lib_image != "" ? var.ddcr_lib_image : null
ddcr_fault = var.ddcr_fault_image != "" ? var.ddcr_fault_image : null
}
}
}
Setting Up Git Connector for ChaosHub
To manage your chaos experiments in Git repositories, first create a Git connector:
resource "harness_platform_connector_git" "chaos_hub" {
depends_on = [
harness_platform_organization.this,
harness_platform_project.this
]
count = var.create_git_connector ? 1 : 0
identifier = replace(lower(var.git_connector_name), " ", "-")
name = var.git_connector_name
description = "Git connector for Chaos Hub"
org_id = local.org_id
project_id = local.project_id
url = var.git_connector_url
connection_type = "Account"
dynamic "credentials" {
for_each = var.git_connector_ssh_key != "" ? [1] : []
content {
ssh {
ssh_key_ref = var.git_connector_ssh_key
}
}
}
dynamic "credentials" {
for_each = var.git_connector_ssh_key == "" ? [1] : []
content {
http {
username = var.git_connector_username != "" ? var.git_connector_username : null
password_ref = var.git_connector_password != "" ? var.git_connector_password : null
dynamic "github_app" {
for_each = var.github_app_id != "" ? [1] : []
content {
application_id = var.github_app_id
installation_id = var.github_installation_id
private_key_ref = var.github_private_key_ref
}
}
}
}
}
validation_repo = var.git_connector_validation_repo
tags = merge(
{ for k, v in var.chaos_hub_tags : k => v },
{
"managed_by" = "terraform"
"purpose" = "chaos-hub-git-connector"
}
)
}
This connector supports multiple authentication methods including SSH keys, HTTP credentials, and GitHub Apps, making it flexible for different Git hosting providers.
Managing ChaosHubs
ChaosHubs let you create libraries of reusable chaos experiments:
resource "harness_chaos_hub" "this" {
depends_on = [harness_platform_connector_git.chaos_hub]
count = var.create_chaos_hub ? 1 : 0
org_id = local.org_id
project_id = local.project_id
name = var.chaos_hub_name
description = var.chaos_hub_description
connector_id = var.create_git_connector ? one(harness_platform_connector_git.chaos_hub[*].id) : var.chaos_hub_connector_id
repo_branch = var.chaos_hub_repo_branch
repo_name = var.chaos_hub_repo_name
is_default = var.chaos_hub_is_default
connector_scope = var.chaos_hub_connector_scope
tags = var.chaos_hub_tags
lifecycle {
ignore_changes = [tags]
}
}
The configuration intelligently uses either a newly created Git connector or an existing one based on your variables, providing flexibility in how you manage your infrastructure.
Implementing Security Governance
This is where things get interesting. Chaos Guard lets you define rules that control chaos experiment execution.
First, create conditions that define what you want to control:
resource "harness_chaos_security_governance_condition" "this" {
depends_on = [
harness_platform_environment.this,
harness_platform_infrastructure.this,
harness_chaos_infrastructure_v2.this,
]
name = var.security_governance_condition_name
description = "Condition to block destructive experiments"
org_id = local.org_id
project_id = local.project_id
infra_type = var.security_governance_condition_infra_type
fault_spec {
operator = var.security_governance_condition_operator
dynamic "faults" {
for_each = var.security_governance_condition_faults
content {
fault_type = faults.value.fault_type
name = faults.value.name
}
}
}
dynamic "k8s_spec" {
for_each = var.security_governance_condition_infra_type == "KubernetesV2" ? [1] : []
content {
infra_spec {
operator = var.security_governance_condition_infra_operator
infra_ids = ["${harness_platform_environment.this.id}/${harness_chaos_infrastructure_v2.this.id}"]
}
dynamic "application_spec" {
for_each = var.security_governance_condition_application_spec != null ? [1] : []
content {
operator = var.security_governance_condition_application_spec.operator
dynamic "workloads" {
for_each = var.security_governance_condition_application_spec.workloads
content {
namespace = workloads.value.namespace
kind = workloads.value.kind
}
}
}
}
dynamic "chaos_service_account_spec" {
for_each = var.security_governance_condition_service_account_spec != null ? [1] : []
content {
operator = var.security_governance_condition_service_account_spec.operator
service_accounts = var.security_governance_condition_service_account_spec.service_accounts
}
}
}
}
dynamic "machine_spec" {
for_each = contains(["Windows", "Linux"], var.security_governance_condition_infra_type) ? [1] : []
content {
infra_spec {
operator = var.security_governance_condition_infra_operator
infra_ids = var.security_governance_condition_infra_ids
}
}
}
lifecycle {
ignore_changes = [name]
}
tags = [
for k, v in merge(
local.common_tags,
{
"platform" = lower(var.security_governance_condition_infra_type)
}
) : "${k}=${v}"
]
}
This configuration supports multiple infrastructure types including Kubernetes, Windows, and Linux, with specific specifications for each platform type.
Then, create rules that apply these conditions with specific actions:
resource "harness_chaos_security_governance_rule" "this" {
depends_on = [harness_chaos_security_governance_condition.this]
name = var.security_governance_rule_name
description = var.security_governance_rule_description
org_id = local.org_id
project_id = local.project_id
is_enabled = var.security_governance_rule_is_enabled
condition_ids = [harness_chaos_security_governance_condition.this.id]
user_group_ids = var.security_governance_rule_user_group_ids
dynamic "time_windows" {
for_each = var.security_governance_rule_time_windows
content {
time_zone = time_windows.value.time_zone
start_time = time_windows.value.start_time
duration = time_windows.value.duration
dynamic "recurrence" {
for_each = time_windows.value.recurrence != null ? [time_windows.value.recurrence] : []
content {
type = recurrence.value.type
until = recurrence.value.until
}
}
}
}
lifecycle {
ignore_changes = [name]
}
tags = [
for k, v in merge(
local.common_tags,
{
"platform" = lower(var.security_governance_condition_infra_type)
}
) : "${k}=${v}"
]
}
This setup ensures that certain types of chaos experiments require approval or are blocked entirely in production environments, giving you confidence to enable chaos engineering without fear of accidental damage. You can also configure time windows for when experiments are allowed to run.
What Happens After Deployment
Once you've applied your Terraform configuration:
- Your service discovery agent starts detecting applications in your configured environments automatically
- Your security governance rules are active, controlling how chaos experiments can be executed
- Your custom ChaosHubs are synchronized and available for use
- Custom image registries are configured if you're using private registries
At this point, you can use the Harness UI to create and configure specific chaos experiments, then execute them against your discovered services. The infrastructure and governance layer is handled by Terraform, while the experiment design remains flexible and can be adjusted through the UI.
Putting It All Together
Here's a practical example of what a complete module structure might look like:
module "chaos_engineering" {
source = "./modules/chaos-engineering"
# Organization and Project
org_identifier = "my-org"
project_identifier = "production"
# Infrastructure
environment_id = "prod-k8s"
infrastructure_id = "k8s-cluster-01"
namespace = "default"
# Chaos Infrastructure
chaos_infra_name = "prod-chaos-infra"
chaos_infra_namespace = "harness-chaos"
chaos_ai_enabled = true
# Service Discovery
service_discovery_agent_name = "prod-service-discovery"
sd_namespace = "harness-delegate-ng"
# Custom Registry (optional)
setup_custom_registry = true
registry_server = "my-registry.io"
registry_account = "chaos-experiments"
is_private_registry = true
# Git Connector for ChaosHub
create_git_connector = true
git_connector_name = "chaos-experiments-git"
git_connector_url = "https://github.com/myorg/chaos-experiments"
git_connector_username = "myuser"
git_connector_password = "account.github_token"
# ChaosHub
create_chaos_hub = true
chaos_hub_name = "production-experiments"
chaos_hub_repo_branch = "main"
chaos_hub_repo_name = "chaos-experiments"
# Security Governance
security_governance_condition_name = "block-destructive-faults"
security_governance_condition_faults = [
{
fault_type = "pod-delete"
name = "pod-delete"
}
]
security_governance_rule_name = "production-safety-rule"
security_governance_rule_user_group_ids = ["platform-team"]
security_governance_rule_is_enabled = true
# Tags
tags = {
environment = "production"
managed_by = "terraform"
team = "platform"
}
}
Best Practices
As you build out your chaos engineering automation, keep these practices in mind:
Start with non-production environments - Test your Terraform configurations and governance rules in development or staging before rolling out to production.
Use separate state files - Maintain separate Terraform state files for different environments to prevent accidental cross-environment changes.
Version your chaos experiments - Store experiment definitions in Git repositories and reference them through ChaosHubs for better collaboration and change tracking.
Leverage conditional resource creation - Use count parameters to optionally create resources like custom registries or Git connectors based on your needs.
Implement proper authentication - Use Harness secrets management for storing sensitive credentials like registry passwords and Git authentication tokens.
Review governance rules regularly - As your understanding of system resilience grows, update your governance conditions and rules to reflect new insights.
Use time windows strategically - Configure governance rules with time windows to allow experiments only during business hours or maintenance windows.
Tag everything - Proper tagging helps with cost tracking, resource management, and understanding relationships between resources.
Combine with CI/CD - Integrate your chaos engineering Terraform configurations into your CI/CD pipelines for fully automated infrastructure deployment.
Moving Forward
Automating chaos engineering with Terraform removes friction from adopting resilience testing practices. You can now treat your chaos engineering setup like any other infrastructure component, with version control, code review, and automated deployment.
The key is starting small. Pick one environment, set up the basic infrastructure and service discovery, then gradually add governance rules and custom experiments as you learn what works for your systems.
For more details on specific resources and configuration options, check out the Harness Terraform Provider documentation.
What aspects of chaos engineering do you think would benefit most from automation in your organization?
Important Links:
New to Harness Chaos Engineering ? Signup here
Trying to find the documentation for Chaos Engineering ? Go here: Chaos Engineering