Recommended Experiments for Production Resilience in Harness Chaos.

All this author’s posts

This guide covers battle-tested chaos experiments for Kubernetes, AWS, Azure, and GCP to help you validate production resilience before real failures happen. Start with low blast radius experiments (pod-level) and gradually progress to higher impact scenarios (node/zone failures), always defining clear hypotheses and using probes to measure results.

Building reliable distributed systems isn't just about writing good code. It's about understanding how your systems behave when things go wrong. That's where chaos engineering comes in.

If you've been wondering where to start with chaos experiments or what scenarios matter most for your infrastructure, this guide walks through battle-tested experiments that engineering teams use to validate production resilience.

Why These Experiments Matter

Here's the thing about production failures: they're not just theoretical. Network issues happen. Availability zones go down. Resources get exhausted. The question isn't whether these failures will occur, but whether your system can handle them gracefully when they do.

The experiments we'll cover are based on real-world failure scenarios that teams encounter in production. We've organized them by infrastructure type so you can quickly find what's relevant to your stack.

A quick tip before we dive in: Start with lower blast radius experiments (like pod-level faults) before progressing to higher impact scenarios (like node or zone failures). This gives you confidence in your testing approach and helps you understand your system's behavior patterns.

Understanding Your Infrastructure Needs

Different infrastructure types face different challenges. Here's what we'll cover:

Kubernetes Experiments
AWS Experiments
Azure Experiments
GCP Experiments

Let's explore each of these in detail.

Kubernetes: The Foundation of Modern Applications

For Kubernetes environments, chaos experiments typically focus on four key areas. Let's walk through each one.

Network Resilience Testing

Network-related failures are among the most common issues in distributed systems. Your application might be perfectly coded, but if it can't handle network degradation, you're setting yourself up for production incidents.

Here are the experiments that matter:

Pod Network Loss tests application resilience to network packet loss at the pod level. This is your first line of defense for understanding how individual components handle network issues.

Node Network Loss simulates network issues affecting entire nodes. This is a node-level experiment that helps you understand how your system behaves when an entire node becomes unreachable.

Pod Network Latency tests application behavior under high latency conditions at the pod level. Latency often reveals performance bottlenecks and timeout configuration issues.

Pod API Block allows you to block specific API endpoints or services at the pod level. This is particularly useful for testing service dependencies and circuit breaker implementations.

Resource Exhaustion Testing

Resource exhaustion is another common failure mode. How does your application behave when CPU or memory becomes constrained? These experiments help you understand whether your resource limits are set correctly and how your application handles resource constraints before they become production problems.

Pod CPU Hog tests application behavior under CPU pressure at the pod level. This helps validate whether your CPU limits are appropriate and how your application degrades under CPU constraints.

Pod Memory Hog validates memory limit handling and out-of-memory (OOM) scenarios at the pod level. Understanding memory behavior prevents unexpected pod restarts in production.

Node CPU Hog tests node-level CPU exhaustion. This experiment reveals how your cluster handles resource pressure when an entire node's CPU is saturated.

Node Memory Hog simulates node memory pressure at the node level. This is critical for understanding how Kubernetes evicts pods and manages memory across your cluster.

Availability Zone Failures

Multi-AZ deployments are great for resilience, but only if they're actually resilient. Zone failure experiments validate that your multi-AZ setup works as expected.

Node Network Loss can simulate complete zone failure when configured with node labels to target specific zones. This is your primary tool for validating zone-level resilience.

Pod Network Loss enables zone-level pod network isolation by targeting pods in specific zones. This gives you more granular control over which applications you test during zone failures.

For detailed zone failure configurations, see the Simulating Zonal Failures section below.

Pod Lifecycle Testing

Pods come and go. That's the nature of Kubernetes. But does your application handle these transitions gracefully? These experiments ensure your application handles the dynamic nature of Kubernetes without dropping requests or losing data.

Pod Delete tests graceful shutdown and restart behavior at the pod level. This is fundamental for validating that your application can handle rolling updates and scaling events.

Container Kill validates container restart policies at the container level. This ensures that individual container failures don't cascade into broader application issues.

Pod Autoscaler tests Horizontal Pod Autoscaler (HPA) behavior under load at the pod level. This validates that your autoscaling configuration responds appropriately to demand changes.

Simulating Zonal Failures

Zonal failures simulate complete availability zone outages, which are critical for validating multi-AZ deployments. Let's look at how to configure these experiments properly.

Node Network Loss for Zonal Failures

The Node Network Loss experiment simulates a complete zone failure by blocking all network traffic to nodes in a specific availability zone.

Key Parameters:

TOTAL_CHAOS_DURATION should be set to 300 seconds (5 minutes) for realistic zone failure testing. This duration gives you enough time to observe failover behavior and recovery processes.

NETWORK_PACKET_LOSS_PERCENTAGE should be set to 100% to achieve complete network isolation, simulating a total zone failure rather than degraded connectivity.

NETWORK_INTERFACE typically uses eth0 as the primary network interface. Verify your cluster's network configuration if you're using a different interface name.

NODES_AFFECTED_PERC should be set to 100 to affect all nodes matching the target label, ensuring complete zone isolation.

NODE_LABEL is critical for targeting specific availability zones. Use topology.kubernetes.io/zone=<zone-name> to select nodes in a particular zone.

Common Zone Labels:

For AWS deployments, use topology.kubernetes.io/zone=us-east-1a (or your specific zone).

For GCP deployments, use topology.kubernetes.io/zone=us-central1-a (or your specific zone).

For Azure deployments, use topology.kubernetes.io/zone=eastus-1 (or your specific zone).

Pod Network Loss for Zonal Failures

The Pod Network Loss experiment provides more granular control by targeting specific applications within a zone. This is useful when you want to test how individual services handle zone failures without affecting your entire infrastructure.

Key Parameters:

TARGET_NAMESPACE specifies the namespace containing your target application. This allows you to isolate experiments to specific environments or teams.

APP_LABEL uses an application label selector (e.g., app=frontend) to target specific applications. This gives you precise control over which services are affected.

TOTAL_CHAOS_DURATION should be set to 300 seconds for realistic zone failure scenarios, matching the duration used in node-level experiments.

NETWORK_PACKET_LOSS_PERCENTAGE should be 100% to simulate complete network isolation for the targeted pods.

PODS_AFFECTED_PERC determines the percentage of pods matching your criteria to affect. Set to 100 for complete zone failure simulation, or lower values for partial failures.

NETWORK_INTERFACE typically uses eth0 as the primary network interface for pod networking.

NODE_LABEL should use topology.kubernetes.io/zone=<zone-name> to target pods running in a specific availability zone.

Network Experiment Best Practices

When running network experiments, there are some important considerations to keep in mind.

General Guidelines

Start Small: Begin with shorter durations (30-60 seconds) and gradually increase as you build confidence in your experiments and understand your system's behavior.

Use Probes: Always configure health probes to validate application behavior during experiments. This gives you objective data about whether your hypothesis was correct.

Monitor Metrics: Track application and infrastructure metrics during experiments. CPU usage, memory consumption, request latency, and error rates are all critical indicators.

Schedule Wisely: Run experiments during maintenance windows or low-traffic periods initially. As you build confidence, you can move to running experiments during normal operations.

Document Results: Keep records of experiment outcomes and system behavior. This creates institutional knowledge and helps track improvements over time.

Pod Network Loss Considerations

One important thing to understand: Pod Network Loss experiments always block egress traffic from the target pods. This is crucial for experiment design. You can configure specific destination hosts or IPs to block, or you can simulate complete network isolation.

Important Parameters:

TARGET_NAMESPACE specifies your target namespace (e.g., production). This ensures experiments run in the correct environment.

APP_LABEL uses an application label selector like app=api-service to target specific applications precisely.

TOTAL_CHAOS_DURATION sets the experiment duration, typically 180 seconds (3 minutes) for most scenarios.

DESTINATION_HOSTS allows you to specify particular services to block using comma-separated hostnames (e.g., database.example.com). Leave empty to block all egress traffic.

DESTINATION_IPS lets you block specific IP addresses using comma-separated values (e.g., 10.0.1.50). This is useful when you know the exact IPs of backend services.

PODS_AFFECTED_PERC determines what percentage of matching pods to affect. Set to 100 to test complete service isolation.

NETWORK_INTERFACE specifies the network interface to target, typically eth0 for standard Kubernetes deployments.

Pod API Block for Egress Traffic

When using Pod API Block, you have fine-grained control. You can block specific API paths, target particular services, and choose whether to block egress or ingress traffic.

Important Parameters for Egress:

TARGET_CONTAINER specifies the container name within the pod that will experience the API block.

TARGET_SERVICE_PORT sets the target service port (e.g., 8080) for the API endpoint you're testing.

TOTAL_CHAOS_DURATION determines experiment duration, typically 180 seconds for API-level testing.

PATH_FILTER allows you to block a specific API path like /api/v1/users, enabling surgical testing of individual endpoints.

DESTINATION_HOSTS specifies target service hostnames using comma-separated values (e.g., api.example.com).

SERVICE_DIRECTION should be set to egress for blocking outbound API calls from the target container.

PODS_AFFECTED_PERC determines the percentage of pods to affect, typically 100 for comprehensive testing.

Pod API Block for Ingress Traffic

For ingress testing, you could block incoming health check requests to see how your monitoring responds.

Important Parameters for Ingress:

TARGET_CONTAINER specifies the container name within the pod that will block incoming requests.

TARGET_SERVICE_PORT sets the port receiving traffic, typically 8080 or your application's serving port.

TOTAL_CHAOS_DURATION determines the experiment duration, usually 180 seconds for health check testing.

PATH_FILTER allows you to block a specific incoming path like /health to test monitoring resilience.

SOURCE_HOSTS specifies source hostnames using comma-separated values (e.g., monitoring.example.com).

SOURCE_IPS lets you target specific source IP addresses using comma-separated values (e.g., 10.0.2.100).

SERVICE_DIRECTION should be set to ingress for blocking incoming requests to the target container.

PODS_AFFECTED_PERC determines the percentage of pods to affect, typically 100 for complete testing.

AWS: Cloud Infrastructure Resilience

AWS infrastructure brings its own set of failure modes. Here's what matters most for AWS workloads.

Recommended AWS Experiments

EC2 Stop simulates EC2 instance failure with high impact. This tests your application's ability to handle sudden instance termination and validates auto-scaling group behavior.

EBS Loss tests application behavior on volume detachment with high impact. This is critical for applications with persistent storage requirements.

ALB AZ Down simulates load balancer AZ failure with medium impact. This validates that your multi-AZ load balancer configuration works as expected.

RDS Reboot tests database failover with high impact. This ensures your database layer can handle planned and unplanned reboots.

Important: AWS experiments require proper IAM permissions. See AWS Fault Permissions for details.

EC2 Stop by ID

The EC2 Stop by ID experiment stops EC2 instances to test application resilience to instance failures and validate failover capabilities.

Key Parameters:

EC2_INSTANCE_ID accepts a comma-separated list of target EC2 instance IDs. You can target a single instance or multiple instances simultaneously.

REGION specifies the AWS region name of the target instances (e.g., us-east-1). All instances in a single experiment must be in the same region.

TOTAL_CHAOS_DURATION is typically set to 30 seconds, which is long enough to trigger failover mechanisms while minimizing impact.

CHAOS_INTERVAL determines the interval between successive instance terminations, typically 30 seconds for sequential failures.

SEQUENCE can be either parallel or serial. Use parallel to stop all instances simultaneously, or serial to stop them one at a time.

MANAGED_NODEGROUP should be set to disable for standard EC2 instances, or enable for self-managed node groups in EKS.

EBS Loss by ID

The EBS Loss by ID experiment detaches EBS volumes to test application behavior when storage becomes unavailable.

Key Parameters:

EBS_VOLUME_ID accepts a comma-separated list of EBS volume IDs to detach. Choose volumes that are critical to your application's operation.

REGION specifies the region name for the target volumes (e.g., us-east-1). Ensure volumes and instances are in the same region.

TOTAL_CHAOS_DURATION is typically 30 seconds, giving you enough time to observe storage failure behavior without extended downtime.

CHAOS_INTERVAL sets the interval between attachment and detachment cycles, usually 30 seconds.

SEQUENCE determines whether volumes are detached in parallel or serial order. Parallel tests simultaneous storage failures.

ALB AZ Down

The ALB AZ Down experiment detaches availability zones from Application Load Balancer to test multi-AZ resilience.

Key Parameters:

LOAD_BALANCER_ARN specifies the target load balancer ARN. You can find this in your AWS console or CLI.

ZONES accepts comma-separated zones to detach (e.g., us-east-1a). Choose zones strategically to test failover behavior.

REGION specifies the region name for the target ALB (e.g., us-east-1).

TOTAL_CHAOS_DURATION is typically 30 seconds for ALB experiments, sufficient to test traffic redistribution.

CHAOS_INTERVAL determines the interval between detachment and attachment cycles, usually 30 seconds.

SEQUENCE can be parallel or serial for detaching multiple zones.

Note: A minimum of two AZs must remain attached to the ALB after chaos injection.

RDS Instance Reboot

The RDS Instance Reboot experiment reboots RDS instances to test database failover and application recovery.

Key Parameters:

CLUSTER_NAME specifies the name of the target RDS cluster. This is required for cluster-level operations.

RDS_INSTANCE_IDENTIFIER sets the name of the target RDS instance within the cluster.

REGION specifies the region name for the target RDS (e.g., us-east-1).

TOTAL_CHAOS_DURATION is typically 30 seconds for the chaos duration, though the actual reboot may take longer.

INSTANCE_AFFECTED_PERC determines the percentage of RDS instances to target. Set to 0 to target exactly 1 instance.

SEQUENCE can be parallel or serial for rebooting multiple instances.

Azure: Testing Your Azure Workloads

For Azure deployments, focus on these key experiments to validate resilience to Azure-specific failures and service disruptions.

Recommended Azure Experiments

Azure Instance Stop simulates VM failure with high impact. This validates that your Azure-based applications can handle unexpected VM termination.

Azure Disk Loss tests disk detachment scenarios with high impact. This is essential for applications with persistent storage on Azure.

Azure Web App Stop validates App Service resilience with medium impact. This tests your PaaS-based applications' ability to handle service disruptions.

Azure Instance Stop

The Azure Instance Stop experiment powers off Azure VM instances to test application resilience to unexpected VM failures.

Key Parameters:

AZURE_INSTANCE_NAMES specifies the name of target Azure instances. For AKS clusters, use the Scale Set name, not the node name from the AKS node pool.

RESOURCE_GROUP sets the name of the resource group containing the target instance. This is required for Azure resource identification.

SCALE_SET should be set to disable for standalone VMs, or enable if the instance is part of a Virtual Machine Scale Set.

TOTAL_CHAOS_DURATION is typically 30 seconds, providing enough time to observe failover without extended disruption.

CHAOS_INTERVAL determines the interval between successive instance power-offs, usually 30 seconds.

SEQUENCE can be parallel or serial for stopping multiple instances.

Tip: For AKS nodes, use the Scale Set instance name from Azure, not the node name from AKS node pool.

GCP: Google Cloud Platform Resilience

For GCP workloads, these experiments validate compute and storage resilience.

Recommended GCP Experiments

GCP VM Instance Stop simulates compute instance failure with high impact. This tests your GCP-based applications' resilience to unexpected instance termination.

GCP VM Disk Loss tests persistent disk detachment with high impact. This validates how your applications handle storage failures on GCP.

GCP VM Instance Stop

The GCP VM Instance Stop experiment powers off GCP VM instances to test application resilience to unexpected instance failures.

Key Parameters:

GCP_PROJECT_ID specifies the ID of the GCP project containing the VM instances. This is required for resource identification.

VM_INSTANCE_NAMES accepts a comma-separated list of target VM instance names within the project.

ZONES specifies the zones of target instances in the same order as instance names. Each instance needs its corresponding zone.

TOTAL_CHAOS_DURATION is typically 30 seconds, sufficient for testing instance failure scenarios.

CHAOS_INTERVAL determines the interval between successive instance terminations, usually 30 seconds.

MANAGED_INSTANCE_GROUP should be set to disable for standalone VMs, or enable if instances are part of a managed instance group.

SEQUENCE can be parallel or serial for stopping multiple instances.

Required IAM Permissions:

Your service account needs compute.instances.get to retrieve instance information, compute.instances.stop to power off instances, and compute.instances.start to restore instances after the experiment.

Experiment Design Best Practices

Now that we've covered the experiments, let's talk about how to run them effectively.

1. Define Clear Hypotheses

Before running any experiment, define what you expect to happen. For example: "When 50% of pods lose network connectivity, the application should continue serving requests with increased latency but no errors."

This clarity helps you know what to measure and when something unexpected happens.

2. Use Resilience Probes

Always configure probes to validate your hypothesis:

HTTP Probes monitor application endpoints to verify they're responding correctly during chaos.

Command Probes check system state by running commands and validating output.

Prometheus Probes validate metrics thresholds to ensure performance stays within acceptable bounds.

Learn more about Resilience Probes.

3. Gradual Blast Radius Increase

Follow this progression:

Single Pod/Container experiments test individual component resilience. Start here to understand how your smallest units behave.

Multiple Pods validate load balancing and failover at the service level. This ensures traffic distributes correctly.

Node Level tests infrastructure resilience by affecting entire nodes. This reveals cluster-level behaviors.

Zone Level validates multi-AZ deployments by simulating complete zone failures. This is your ultimate resilience test.

4. Schedule Regular Experiments

Make chaos engineering a continuous practice:

Weekly: Run low-impact experiments like pod delete and network latency. These keep your team sharp and validate recent changes.

Monthly: Execute medium-impact experiments including node failures and resource exhaustion. These catch configuration drift.

Quarterly: Conduct high-impact scenarios like zone failures and major service disruptions. These validate your disaster recovery plans.

Use GameDays to organize team chaos engineering events.

5. Monitor and Alert

Ensure proper observability during experiments:

Configure alerts for critical metrics before running experiments. You want to know immediately if something goes wrong.

Monitor application logs in real-time during experiments. Logs often reveal issues before metrics do.

Track infrastructure metrics including CPU, memory, and network utilization. These help you understand resource consumption patterns.

Use Chaos Dashboard for visualization and real-time monitoring of your experiments.

Getting Started

The best way to get started with chaos engineering is to pick one experiment that addresses your biggest concern. Are you worried about network reliability? Start with Pod Network Loss. Concerned about failover? Try Pod Delete or EC2 Stop.

Run the experiment in a test environment first. Observe what happens. Refine your hypothesis. Then gradually move toward production environments as you build confidence.

Here are some helpful resources to continue your chaos engineering journey:

Remember, chaos engineering isn't about breaking things for the sake of breaking them. It's about understanding your system's behavior under stress so you can build more resilient applications. Start small, learn continuously, and gradually expand your chaos engineering practice.

What failure scenarios keep you up at night? Those are probably the best experiments to start with.

‍

Ashutosh Bhadauriya

All this author’s posts

Senior Developer Relations Engineer

Recommended Experiments for Production Resilience in Harness Chaos Engineering | Harness Blog

Why These Experiments Matter

Understanding Your Infrastructure Needs

Kubernetes: The Foundation of Modern Applications

Network Resilience Testing

Resource Exhaustion Testing

Availability Zone Failures

Pod Lifecycle Testing

Simulating Zonal Failures

Node Network Loss for Zonal Failures

Pod Network Loss for Zonal Failures

Network Experiment Best Practices

General Guidelines

Pod Network Loss Considerations

Pod API Block for Egress Traffic

Pod API Block for Ingress Traffic

AWS: Cloud Infrastructure Resilience

Recommended AWS Experiments

EC2 Stop by ID

EBS Loss by ID

ALB AZ Down

RDS Instance Reboot

Azure: Testing Your Azure Workloads

Recommended Azure Experiments

Azure Instance Stop

GCP: Google Cloud Platform Resilience

Recommended GCP Experiments

GCP VM Instance Stop

Experiment Design Best Practices

1. Define Clear Hypotheses

2. Use Resilience Probes

3. Gradual Blast Radius Increase

4. Schedule Regular Experiments

5. Monitor and Alert

Getting Started

Similar Blogs

Running Chaos Engineering on GKE Autopilot Just Got Easier

AI-Powered Chaos Engineering with Harness MCP Server and Cursor

AI-Native Application Security

2025

Recommended Experiments for Production Resilience in Harness Chaos Engineering | Harness Blog

Why These Experiments Matter

Understanding Your Infrastructure Needs

Kubernetes: The Foundation of Modern Applications

Network Resilience Testing

Resource Exhaustion Testing

Availability Zone Failures

Pod Lifecycle Testing

Simulating Zonal Failures

Node Network Loss for Zonal Failures

Pod Network Loss for Zonal Failures

Network Experiment Best Practices

General Guidelines

Pod Network Loss Considerations

Pod API Block for Egress Traffic

Pod API Block for Ingress Traffic

AWS: Cloud Infrastructure Resilience

Recommended AWS Experiments

EC2 Stop by ID

EBS Loss by ID

ALB AZ Down

RDS Instance Reboot

Azure: Testing Your Azure Workloads

Recommended Azure Experiments

Azure Instance Stop

GCP: Google Cloud Platform Resilience

Recommended GCP Experiments

GCP VM Instance Stop

Experiment Design Best Practices

1. Define Clear Hypotheses

2. Use Resilience Probes

3. Gradual Blast Radius Increase

4. Schedule Regular Experiments

5. Monitor and Alert

Getting Started

Similar Blogs

Running Chaos Engineering on GKE Autopilot Just Got Easier

AI-Powered Chaos Engineering with Harness MCP Server and Cursor

the State of

AI-Native Application Security

2025