Monitoring Chaos Experiments with New Relic Probe in Harness

All this author’s posts

New Relic probes in Harness Chaos Engineering let you automatically validate system performance against defined SLOs during chaos experiments, transforming subjective testing into objective, metrics-driven resilience validation. By querying New Relic metrics in real-time and comparing results against your success criteria, you can programmatically verify that your systems maintain acceptable performance levels even under failure conditions. This approach enables automated reliability testing in CI/CD pipelines, catching performance regressions before they reach production.

When running chaos experiments in production environments, observability isn't optional. You need real-time insights into how your system behaves under stress. This is where the New Relic probe in Harness Chaos Engineering becomes essential.

The New Relic probe lets you define metrics-based Service Level Objectives (SLOs) right within your chaos experiments. Instead of manually checking dashboards or waiting for alerts, you can automatically query New Relic metrics during an experiment and use those results to determine whether your system passed or failed the chaos test.

Why Use APM Probes in Chaos Engineering?

Think of probes as your experiment's eyes and ears. While you're injecting faults like pod deletions or network latency, probes continuously monitor application performance metrics. The New Relic probe specifically helps you answer questions like:

Did response time stay within acceptable limits during the chaos?
Were error rates below your SLO threshold?
Did database query performance degrade beyond acceptable levels?

By codifying these checks as part of your experiment definition, you move from subjective chaos testing to objective, repeatable validation of system resilience.

What You'll Need

Before setting up a New Relic probe, make sure you have:

An active New Relic account with your application already sending metrics
Network access to the New Relic NerdGraph API from your Kubernetes execution plane
A New Relic User API key (not a License key, this is important for NerdGraph authentication)
Your New Relic account ID

Setting Up Your New Relic Probe

Creating the Probe

Start by navigating to the Probes & Actions section in the Harness Chaos module. Click New Probe and select APM Probe from the options. Give your probe a name and choose New Relic as the APM type.

Configuring the New Relic Connector

The connector handles authentication with New Relic. You can reuse an existing connector or create a new one. If you're creating a new connector, you'll need to provide:

Basic Information

Give your connector a name. Adding a description and tags helps with organization, especially if you're managing multiple environments or teams.

‍

Connection Details

The New Relic URL depends on your account region. Use https://api.newrelic.com/graphql for US accounts or https://api.eu.newrelic.com/graphql for EU accounts. The integration uses the NerdGraph API, which provides more flexible querying capabilities than older REST APIs.

You'll also need your New Relic account ID, which you can find in the New Relic UI or extract from your account URL.

Authentication

For the API key, create or select a secret containing your New Relic User API key. This is crucial: you must use a User key specifically, not a License key. User keys have the proper permissions for NerdGraph API queries. You can generate one in your New Relic account settings under API keys.

After entering these details, select a delegate and verify the connection works before finishing the connector setup.

Defining Your Probe Logic

Once your connector is configured: defining what metrics to monitor and what constitutes success or failure.

Writing Your NRQL Query

New Relic uses NRQL (New Relic Query Language), a SQL-like language for querying telemetry data. Your query should target the specific metric you want to monitor during chaos.

For example, to monitor average response time:

SELECT average(duration) FROM Transaction WHERE appName = 'payment-service' SINCE 5 minutes ago

Or to check error rates:

SELECT percentage(count(*), WHERE error IS true) FROM Transaction WHERE appName = 'payment-service' SINCE 5 minutes ago

The key is making your query specific to the service being tested and the time window relevant to your chaos experiment duration.

Extracting the Metric

After defining your query, specify which metric field to extract from the results. If your query uses SELECT average(duration), the metric name would be average.duration. For the error rate example above, you might use percentage.

Setting Success Criteria

Now define what values indicate a healthy system. You'll specify:

The data type (Float or Int)
A comparison operator (>=, <=, ==, !=, >, <)
The threshold value

For instance, you might say average response time must be <= 500 (milliseconds), or error rate must be < 1 (percent).

Configuring Probe Execution

Finally, set the runtime behavior:

Timeout defines how long a single probe execution can run before timing out. Set this based on your query complexity, typically 10-30 seconds.

Interval controls how often the probe runs during the experiment. For continuously monitoring metrics, you might set this to 2-5 seconds.

Attempt specifies retry attempts if a probe execution fails due to network issues or temporary API unavailability.

Polling Interval determines the wait time between attempts, useful for backing off on retries.

Initial Delay lets you wait before starting probe checks, giving your chaos injection time to propagate through the system.

Verbosity controls log detail, helpful for debugging probe behavior.

You can also enable Stop On Failure if you want the entire experiment to halt immediately when the probe detects a problem, rather than continuing through the full chaos duration.

Putting It All Together

Once configured, your New Relic probe becomes an integral part of your chaos experiment's resilience validation. During execution, the probe will:

Query New Relic at your specified intervals
Extract the target metric from query results
Compare it against your defined criteria
Mark each check as pass or fail
Contribute to the overall experiment verdict

This transforms chaos engineering from a manual, observation-based practice into an automated, metrics-driven validation process. You're not just injecting failures and hoping everything looks okay. You're programmatically verifying that your SLOs hold up under adverse conditions.

The real power emerges when you integrate these probes into your CI/CD pipeline. Every deployment can include chaos experiments with New Relic probes validating that performance characteristics remain within bounds even under failure scenarios. This catches regressions before they reach production and builds confidence in your system's resilience.

Getting Started

Ready to add metrics-based validation to your chaos experiments? Start by identifying your most critical SLOs, then use New Relic probes to enforce them during controlled chaos testing. The combination of deliberate failure injection with automated observability checks gives you confidence that your systems can handle whatever production throws at them.

Remember, chaos engineering isn't about breaking things randomly. It's about systematically validating that your systems behave correctly under known failure conditions. New Relic probes give you the measurement layer that turns chaos experiments from interesting demos into rigorous reliability tests.

‍

Ashutosh Bhadauriya

All this author’s posts

Senior Developer Relations Engineer

Monitoring Chaos Experiments with New Relic Probe in Harness

Why Use APM Probes in Chaos Engineering?

What You'll Need