Do Your Production Deployments Actually Succeed or Fail?
You've just deployed a new version of your service/app into production. You open your web browser and hit the live version in production. All looks well - or does it?
Failure Is Well Understood, Success Isn't
The truth is, success in IT is rarely talked about, or at all, defined, measured, or promoted. As humans, we expect success - it's called "doing our job." Failure, on the other hand, is always talked about, measured, and communicated.
I always ask customers, “How do you verify production deployments today?” The most common response is, "We do it manually, mostly using gut feel. We have a handful of engineers checking logs and monitoring tools." The second most common response is normally laughter, followed by "If the app stays up” or "If no one complains."
The bottom line is that deployment success is not consistently measured or verified. Success is through the eye of the beholder, and therefore, it's massively subjective.We spend millions on engineering software (and tools). And yet when it comes to answering the most basic of questions, the majority still struggle and fall short. Applications have changed dramatically over the years, and yet organizations still use the same old metrics to try and convince themselves everything is OK when the big red deploy button is pressed.
Cheap, Old, and Easy Metrics Aren’t Relevant
I’m talking about Availability and the good 'ol "5 Nines." Sure, measuring availability is better than not measuring it, but then again, what exactly are you measuring and how are you calculating availability? The availability of servers, OS, application run-time, ports or 3rd party APIs? You can end up with completely different availability, not to mention the implications of averaging these metrics over time.
Next up, let's consider resource utilization. Did CPU spike? What utilization are the servers running at following the deployment? Is this metric even relevant today with cloud compute, containers, and auto-scaling? Why is resource consumption important in 2018?
As Vanilla Ice once said: “Alright stop, collaborate and listen" - it's about time teams defined what success is for their services, apps and production deployments. Why? Because it would align everyone on what success and failure actually look like.
Measure The Business Impact
Imagine measuring the whole reason and hypothesis of why someone in the business told you to build something in the first place...imagine that...sounds crazy doesn't it? Yet, in IT we build and ship stuff yet we never go back to see what the real business impact actually was.
It's 2018 and there is really no excuse for Dev, IT Ops or DevOps not having business metrics at their disposal. Application Performance Monitoring (APM) solutions like AppDynamics, New Relic, and Dynatrace all allow you to instrument business metrics from your services and applications running in production.
Measuring revenue from a service, app or business transaction is as easy as measuring response time these days. So if you own any of these tools, you really should be extracting business KPIs from your monitoring platforms. Yes, you can keep your application and infrastructure metrics, but don't let them mislead you if your business KPIs and metrics are all in line.
What If Continuous Delivery Could Also Verify Deployments?
Imagine if your deployment solution could magically link to your monitoring tools in production and tell you the exact business impact of every deployment. Better still, what if your deployment solution could automatically roll back when negative business impact is observed? I have good news. You can do exactly that with Harness.
Better still, you can perform Canary-style deployments so that only a fraction of your production environment is upgraded and verified during a deployment without the need to expose your entire users, traffic, and business.
For example, below is a screenshot of Harness' Continuous Verification capability analyzing AppDynamics data. We basically use unsupervised machine learning to analyze the millions of time-series metrics that monitoring tools capture and manage, but we do so in the context of every production deployment. Our deployment workflows basically deploy your new services/apps and then connect to your monitoring tools to analyze what is happening from a performance, quality, and business perspective.
As you can see in the screenshot, the revenue per minute for one business transaction (Checkout) in the service that was upgraded by Harness actually decreased 77% post-deployment. In this scenario, you would want Harness to roll back the deployment to the previous or last working version... and that is exactly what happened:
Define What Success Really Is (and Means) To Your Business
I'm sick and tired of hearing about infrastructure monitoring; it's kinda pointless without application or business context.
You end up measuring the bits and bytes of something without the ability to understand what impact that bit or byte is having on your business or customers. Pretty infra dashboards and charts don't tell you whether your customers are happy, or whether that new service you rolled out actually increased revenue by 34%.
Yes, monitoring your entire stack is important and valuable, and monitoring something is better than monitoring nothing. But please think about what success actually means for your business. Take that success definition and start applying it to your deployment pipelines so you can govern what success and failure are for each deployment.
Wouldn't it be awesome if you could measure the business impact of every deployment and start sharing your success as well as failures?