Pipeline Performance: How We Optimized With Load Tests
As Harness continues to grow at rocket pace, we needed to optimize our platform for performance. Here's how we did it.
This article on pipeline performance was written in collaboration by Naidu Annepu, Prashant Pal, and Sahil Hindwani.
Harness is growing quickly in the DevOps space, and with higher growth comes the need for better scalability and performance. To meet the needs of our rapidly-expanding customer base, we decided to run an activity around the performance of our pipeline execution. We started performing load tests on our services and made changes on the fly to improve pipeline performance further. The main goal of this activity was to make our system more performant and scalable.
With these optimizations, we:
- Made our executions better, in terms of speed and time. We observed a 50% increase in performance.
- Provided more real-time visualization of your execution.
Curious how we achieved this? Read along to find out more about our approach and learnings.
Before starting our performance experiment, we gathered some data around how much scale we should aim for. We calculated the average number of builds we generate, the number of PR checks we run, the number of deployments we do, and any interaction we have with our tool. With this, we could get an estimate of the scale for a mid-size organization. We extrapolated the above information and tested our platform scale to 100x the current load.
We were aiming for the following scale:
- 400k pipelines per day.
- 51 million events/day during execution.
Time to Put on Some Load
Once our scale estimation exercise was over, we decided to put some load on our service. We did some analysis around some load testing tools and decided to use locust.
Adding stress to our system is meaningless unless we have monitoring in place. For this, we used our Continuous Verification module to figure out how services behaved with load. We also used Opencensus to publish some metrics on GCP to help us understand where we spend a lot of time and where we can optimize.
We performed numerous activities with locust wherein we were putting load on our system again and again. Every time we performed a new activity, we made a few optimizations to our services.
Some screenshots from our monitoring dashboards are below.
First run of the experiment:
Final run of the experiment:
At Harness, our executions are completely event driven. At the start of this activity, our event framework was built upon legacy mongo queues. We inherited legacy Mongo queues and modified them according to our use case by building a wrapper and framework around it.
Though Mongo queues functioned well in the past, for our use case, we faced certain limitations in performance because of it. Those were:
- Degradation because of write-conflicts happening on Mongo instances.
- The Mongo queue is a shared resource. Since we have multiple instances of the service running and we need only one instance to process the event, we need to get lock on the event entry inside Mongo. For that, we need to update the entry to let other instances know that this entry is being processed. This caused too many write conflicts when we had many instances, thereby causing some slowness.
- Reads from Mongo were a lot slower and were one of the major causes of the increase in our execution time.
- As explained, we need to update the entry inside mongo to let other services know if it is being processed or not. Since it was a read-write-operation, it had to go through master only, which caused some issues for our events as replicas/slaves became meaningless.
With the above limitations, Mongo wasn’t a good fit for our use case, so we realised that we need to evaluate more queuing systems like Kafka, Redis Streams, etc. We evaluated Kafka and Redis Streams, but decided on using Redis Streams as it fit our use case perfectly and we were already using it for different things.
We have always heard that nobody is perfect, and so was Redis in our case. Redis provides many things out of the box, but there were some issues we faced while migrating to it.
- Redis does not guarantee that the event will be processed only once.
- Redis is based on the ‘at least once’ paradigm. and by design. it does not allow the above. In order to achieve our goal, we made a few changes in our framework so as not to process the same event twice.
- Size was an issue in Redis, but was not on Mongo.
- Redis is an in-memory store, and storing huge chunks of data in-memory was not a good option. Our team decided to step up and made our events lighter.
We faced the above issues, but could migrate it successfully. Now, we decided to run our experiment against Redis. To our surprise, we did observe a huge performance gain.
A few gotchas we got when using Redis:
- Redis could sometimes cause an increase in CPU if not configured properly.
- In a replica-based environment, the order of events should not matter much. If it does, then events might not be the best choice for you.
Here's a nifty graph on performance pre and post optimization. As you can see, the results are impressive!
We’re thrilled with the results we received after optimizing for pipeline performance. It’s amazing what load testing and Redis can do! Are you looking for a performant CI/CD solution? Take Harness out for a spin today.