Product
|
Cloud costs
|
released
June 5, 2023
|
3
min read
|

Chaos Engineering with Jenkins

Updated

Harness Chaos Engineering or Harness CE provides the end-to-end tooling required to conceptualize, design, develop, run and analyze the chaos experiments to verify the resilience in all stages of DevOps or the SDLC. Other CI/CD tools like Jenkins can easily add these chaos experiments in their pipelines. In this blog, we discuss how to achieve the integration between Harness Chaos Engineering and Jenkins pipelines.

Benefits of Running Chaos Experiments in CD Pipelines

The following is why organisations will run chaos experiments in CD pipelines.

CD ROI vs. Resilience Unknowns

Enterprises have invested in CD pipeline technology to achieve higher agility with lower developer toil. Is this ROI being offset with a potential loss of reliability? Instead of leaving it to chance, simply verify the resilience of the code you ship against the possible chaos scenarios to achieve the maximum ROI of your CD investment.

Increase Developer Efficiency

Are your developers staying up to date on the design and architecture changes happening? A successful functional test in a pipeline is not guaranteeing a successful validation of the design, architecture, or dependency changes. Not all developers understand these changes deeply. Well-written chaos experiments can bring gaps in these areas to the developers’ immediate attention when they break the pipelines. Developers are attending to the design gaps or implementation gaps at the earliest, in the pipelines, rather than in production at a higher cost.

Reduce Resilience Debt

Just like tech debt, there is resilience debt that can keep building on your production services. Alerts and incidents get registered in your production environment, ending up in either the resolved or the to-be-watched queue. These alerts and incidents sometimes result in a hot-fix/hot-code-patch or a config change. Many times developers just end up adding a workaround like increasing the memory, adding CPU, or adding more nodes. In both cases, the product teams can take the feedback and act on them by adding relevant verification tests through chaos tests into the pipelines. Policies can be configured or enforced to add relevant chaos experiments before pipelines are approved for deployment. As an example, there can be policy that states  - all the incidents and alerts caused by a component misbehaving (OR) network loss (OR) network slowness (OR)  external APIs not responding as expected (OR) higher load, etc., must have a corresponding chaos experiment validated in a pipeline simulation within 60 days of such an incident or alert. This will bring in a discipline to check on the resilience debt being built up. Developers and QA teams will be forced to focus on what needs to be fixed in the production code rather than continue to push more new capabilities into production.

Use Cases for Chaos Experimentation in CD Pipelines

CD Pipelines are run when there are new code changes to be deployed. Functional and integration tests are performed to ensure that the recent changes are not breaking anything in the expected functionality or that new features are working as expected. In addition to verifying the new code functionality, pipeline runs provide an opportunity to verify the following resilience scenarios.

Use cases for chaos engineering in CD pipelines

Validate the deployments against the existing resilience conditions

Resilience coverage in a pipeline always starts with a low number, but you can only increase the coverage when you have automated them well. Every change deployed through the pipeline is tested for the already-known resilience conditions.  This is about ensuring the resilience score is stable in the target environment.

Validate the deployments against the newly added resilience conditions

While code changes being deployed is one dimension, the new resilience tests or chaos experiments added to the pipeline increase the significance of the pipeline run. This is about increasing resilience coverage. If increasing the resilience coverage means that the resilience score is reduced, you found a potential weakness that is yet to cause an outage in production but is worth looking at now. This involves bringing the developers to look into the failed chaos experiment and then deciding to stop the pipeline or ignore it now and take action in parallel, like documenting a recovery scenario or a note to the SREs or suggesting a config change, etc. 

Changes to the platform on which the target deployments run

A change in the underlying platform, such as the Kubernetes version, brings a lot of attention to the testing that happens in the pipeline. The resilience score may reduce, indicating new potential weaknesses are found under the upgraded platform. 

Validate the deployments against production incidents and alerts

This use case requires coordination between the SRE/Ops teams and the pipeline builders to add resilience tests related to recently found incidents or alerts. Each incident or alert can potentially result in a new chaos experiment if not already added to the database. 

Validate the deployments against configuration changes

If the new changes are about configuration changes within the software or in the target infrastructure, the resilience tests can bring new knowledge about potential weaknesses. This is another scenario where the resilience tests that passed earlier will start failing because the target environment changed through a higher or lower configuration.

A Step-by-step Guide to Introducing Chaos Experiments into Jenkins CI/CD Pipelines

Developers create chaos experiments in the chaos project. The chaos experiments are then pulled into the pipeline as steps. Each run of a chaos step results in either meeting the expected resilience score or failing to meet it, at which point the configured failure strategy of the pipeline stage can be invoked.

Adding Harness CE chaos experiments into Jenkins CI/CD pipelines

Step 1:

Create a chaos experiment and run it to make sure it runs to completion. The relevant probes are added to avoid a false positive or false negative scenario around the resilience score.

Create a chaos experiment in Harness CE

Step 2:

Prepare the API command that launches the chaos experiment

Harness CE provides APIs to run or launch an experiment. The API requires the user to append many parameters and the user needs to format it for execution. Harness provides an easy to use CLI to prepare this API. The CLI tool is called hce-api-linux. You can download it for Linux from here.  

Here is an example of the usage of the hce-api-linux CLI in creating a chaos-launch script of a chaos experiment.

#!/bin/bash

set -e

curl -sL https://storage.googleapis.com/hce-api/hce-api-linux-amd64 -o hce-api-saas

chmod +x hce-api-saas

output=$(./hce-api-saas generate --api launch-experiment --account-id=${ACCOUNT_ID} \
--project-id ${PROJECT_ID} --workflow-id ${WORKFLOW_ID} \
--api-key ${API_KEY} --file-name hce-api.sh | jq -r '.data.runChaosExperiment.notifyID')

echo ${output}

As you can see from the above sample script, there are some parameters that you need to provide to generate a successful API. These parameters are:

API_KEY: User's API key at Harness. To know more details about how the user's API keys are managed, see the documentation here.

WORKFLOW_ID: The unique id of the chaos experiment in the project under a specific account. You can copy it from the Chaos experiment table on Harness CE module.

ACCOUNT_ID: Account ID on Harness which can be found at the Account Settings tab.

PROJECT_ID: The ID of the project in the chaos module where the chaos experiment is created. It is found under the name of the project when the projects are listed.

Step3:

Prepare the API command that retrieves the result of the chaos experiment run a.k.a the Resilience Score

Using the same CLI command, by passing different command parameters, you can retrieve the latest result of a chaos experiment run. The result that gets retrieved is nothing but the latest Resilience Score. This score is then used to take appropriate action such as to move to the next stage/step in the pipeline or invoke a failure strategy such as rollback. 

Here is an example of retrieving the result of a chaos experiment run.

#!/bin/bash

set -e 

curl -sL https://storage.googleapis.com/hce-api/hce-api-linux-amd64 -o hce-api-saas

chmod +x hce-api-saas

resiliencyScore=$(./hce-api-saas generate --api validate-resilience-score  --account-id=${ACCOUNT_ID} \
--project-id ${PROJECT_ID} --notifyID=$1  \
--api-key ${API_KEY} --file-name hce-api.sh)

echo "${resiliencyScore}"

Step 4: Use the prepared  API commands in Jenkins scripts to run the chaos experiments

The two commands in the previous steps will help you run a given chaos experiment and retrieve the result of that given chaos experiment run. Inject these API commands into the chaos-step or the chaos-stage of your Jenkins pipeline script. An example is given below.

stage('Launch Chaos Experiment') {
            steps {
                 sh '''
                    sh scripts/launch-chaos.sh > n_id.txt
                 '''
                 script {
                     env.notify_id = sh(returnStdout: true, script: 'cat n_id.txt').trim()
                 }   
            }   
        }
        
        stage('Monitor Chaos Experiment') {
            steps {
                sh '''
                    sh scripts/monitor-chaos.sh ${notify_id}
                '''
            }
        }
        
        stage('Verify Resilience Score') {
            steps {
                sh '''
                    sh scripts/verify-rr.sh ${notify_id} > r_s.txt
                '''
                script {
                    env.resilience_score = sh(returnStdout: true, script: 'cat r_s.txt').trim()
                 }
            }
        }
        
        stage('Take Rollback Decision') {
            steps {
                sh '''
                    echo ${resilience_score}
                    sh scripts/rollback-deploy.sh ${resilience_score}
                '''
            }
        }

Summary

Introducing chaos experiments into the pipelines brings in significant benefits to the reliability of the business critical services, improves developer productivity and mitigates the risk of not handling the unknowns to a large extent. Use Jenkins and Harness CE together to seamlessly introduce resilience tests into the deployment stages of your pipelines.

Get started with Harness Chaos Engineering

Harness Chaos Engineering is built with the above building blocks needed to roll out the Continuous ResilienceTM approach of chaos engineering. It comes with many out-of-the-box faults, security governance, chaos hubs, the ability to integrate with CD pipelines and Feature Flags, and many more.

Harness Chaos Engineering Free Plan

Sign up now

Sign up for our free plan, start building and deploying with Harness, take your software delivery to the next level.

Get a demo

Sign up for a free 14 day trial and take your software development to the next level

Documentation

Learn intelligent software delivery at your own pace. Step-by-step tutorials, videos, and reference docs to help you deliver customer happiness.

Case studies

Learn intelligent software delivery at your own pace. Step-by-step tutorials, videos, and reference docs to help you deliver customer happiness.

Sign up now

Sign up for our free plan, start building and deploying with Harness, take your software delivery to the next level.

Get a demo

Sign up for a free 14 day trial and take your software development to the next level

Documentation

Learn intelligent software delivery at your own pace. Step-by-step tutorials, videos, and reference docs to help you deliver customer happiness.

Case studies

Learn intelligent software delivery at your own pace. Step-by-step tutorials, videos, and reference docs to help you deliver customer happiness.

We want to hear from you

Enjoyed reading this blog post or have questions or feedback?
Share your thoughts by creating a new topic in the Harness community forum.

Sign up for our monthly newsletter

Subscribe to our newsletter to receive the latest Harness content in your inbox every month.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Chaos Engineering
Continuous Delivery & GitOps