November 21, 2019

Move Fast and Fix Faster with Harness 24/7 Service Guard

Table of Contents

It’s been a couple of years since Harness started deploying to production every day.
Our goal was to not only sell Continuous Delivery, but to also lead by example, and eat our own dog food through our staging-devharnessio.kinsta.cloud platform. 
We needed to rely on extensive automation and monitoring to maintain our high velocity — and not give our on-call team a heart attack.
If you are lucky enough to work in an organization that has made these investments, you can: 

  1. Deploy your services to a DEV or QA environment.
  2. Run your automation/performance suite.
  3. Use Harness’ machine learning to flag anomalies before your services make it to production. 

You can truly achieve our mission statement: "Move fast and don’t break things!"
But let’s be honest here: Unless you're deploying straightforward applications, it is not possible to simulate production in QA. 
As a deployment platform, the breadth and depth of Harness' platform capabilities simply rules this out. We can all agree this is a common theme across large enterprises with complicated applications.
So what should our motto truly be? We want to move fast for sure, but we also want to fix things faster. At Harness, we designed our 24/7 Service Guard product to do exactly this.
There are 3 main building blocks to achieve this:

  1. You need Harness of course :) - How else can you get Continuous Delivery and 24/7 Service Guard capability? If you are thinking “Wait I have APM," you need to read “What Ails APM today
  2. You need to identify and collect the application signals for your production environments. These signals are fed into 24/7 Service Guard to flag anomalies using its machine learning. Here are the signals that work well for most applications:

Transactional metrics: You can see an example of how we catch the response time increase for one of our critical transactions below:


Here is another example: when the overall application response time starts to deteriorate:


If you are an astute observer, you would have asked, “What’s the heatmap do?” Great question. It represents the risk, color-coded from green to red over a 15 minute period. The warmer the color, the greater the risk.
One look tells you all you need to know about the health of your response time in the last 1 hour:


JVM metrics:


I am going to stop drawing circles on the pictures. I think by now you can easily see that the thread counts started climbing at 7:30 am and nearly doubled in a short period.
Database metrics: We use Mongo and TimescaleDB. Metrics on Opcounters, documents inserts/deletes, etc are useful when monitoring Mongo.


I am getting tired of grabbing screenshots. Let me put all the other signals here:

  • Queue sizes
  • Application internal metrics from Codahale registry
  • Infrastructure metrics: CPU, memory, etc from Stackdriver (we are a GCP shop)
  • Application logs from Stackdriver (again GCP)
  • Container restarts
  • Gateway/load balancer metrics
  • Bugsnag logs for UI

We integrate with a ton of providers and support a variety of strategies.
Here is a cool view of the 24/7 Service Guard dashboard for a particular service called “Verification Service” across 2 environments named “production” and “freemium."


You can see the heatmap signatures for the different signals along with the deployment events denoted by the red and green circles.
Once you have 24/7 Service Guard enabled, you can configure risk-based alerts using your notification systems. We use Slack and PagerDuty.


We got an alert on Slack at 9:31 am that was sent out to PagerDuty as well. The on-call person jumps in and the situation is resolved by 9:49 am as denoted by the incident closed alert. A quick turnaround. Not bad!!
There you have it. This is Harness really using Harness to be Harness. For the skeptics out there that think that this is an elaborate ploy by me to fake all of this ML prowess, I have a bridge to sell you :)
But seriously, here is a video on “How Build.com detects issues and rolls back in production."  
We use unsupervised techniques to model the various aspects of any time-series (read “What Ails APM today” first). Since this is live data that we don’t own, it has to fit within the class of unsupervised learning (meaning no labels). Think predictive modeling with unsupervised deep neural networks. If you want to know all the machine learning details under the hood, then you have to join my team :). 
That concludes this lecture. Remember to “Move fast, but fix faster."
Cheers!
Sriram.

You might also like
No items found.

Similar Blogs

No items found.
Gitness
Code Repository
Software Supply Chain Assurance
Infrastructure as Code Management
AIDA
Continuous Error Tracking
Internal Developer Portal
Software Engineering Insights
Platform
Cloud Cost Management
Chaos Engineering
Continuous Delivery & GitOps
Security Testing Orchestration
Service Reliability Management
Feature Flags
Continuous Integration