What Is Testing in Production & How to Avoid The Risks
Testing in production used to be a joke. It's not so funny now, because the best tests to learn from are the ones that most match reality.
About a decade ago, “testing in production” was a not-uncommon joke among software engineers. It was a shorthand way of describing something you definitely would never do... until eventually, some tools started encouraging you to do exactly that. You could run a Google search for testing in production, and you’ll quickly get your fill of “testing in production” memes.
Testing in production, in short, means using live data and live users to figure something out. It can be performance-related, user-behavior related, or anything else that you want to see, learn from, and react to based on what’s really happening when your apps serve live users.
Should Testing Be Done in Production?
Jokes aside, testing in production is something you should absolutely think about. Consider this: despite doing every manner of pre-production testing, you often find issues that weren’t caught in any test environment or staging environment - test data isn’t perfect. Performance testing and load testing only get you so far without production data. Integration tests, unit tests, regression tests, and any other software tests you run often miss out on some case that isn’t realized until real production data and production traffic run through the system. You’re running tests, but without running the real tests - this creates rework and slows down future feature release velocity.
Writing tests, good QA, and having PMs and other folks review changes prior to deployment - this all goes a long way in getting things right. But, there are some things that are elusive to try and learn with mock data, or at different scale, such as:
- What does your real load do to your latest changes?
- What do users do that’s different than expected?
- Are you seeing the performance change you hoped to see?
- Are users reacting the way you intended them to?
- How does your change react under various stress situations?
You can try to learn some of this on your way to prod, but the most effective way to answer these questions, both technical and user, is taking things for a spin with actual users and actual traffic; you need real production data and real production traffic to be as effective as possible.
What ISN’T Testing in Prod?
Before we dive any further into testing in production, let’s be clear on what testing in production is NOT. This is especially important to clarify when we consider what lives in a production environment and the impact of poor production testing practices.
Testing in production is not constantly doing deployments and rollbacks. And, it is not being cavalier about what you release. When testing in production, you are not using your live production data to make sure things work. That’s still what software testing and QA are for across test environments, staging environments, and any other place you test your code before your production environment.
In your actual production environment, production testing is all about learning, iterating, and helping inform your decisions with the best and most accurate data possible - after code has already gone through the standard deployment process. By serving production traffic and real users, you collect user data, test software variations, and get monitoring signals that you can use to understand whether what you deliver is what you were trying to achieve.
Why Use Feature Flags to Test in Production
Feature flags are critical to effectively testing in production. We say this because feature flags allow you to turn things on and off instantly against any criteria you want without the complexity of a rollback or redeployment. And to make sure we beat the dead horse, yes, this is all in a production environment. Let’s take a look at what required capabilities feature flags support in a production system that make them so critical:
- Kill switch functionality: turning things on and off instantly without rollbacks or redeploys.
- User segmentation: deciding exactly who gets access to features, allowing you to run specific production tests or get specific production data.
- Traffic segmentation: choosing which of your production traffic will be directed to a specific experience or change, allowing you to do load testing and performance testing.
- Progressive delivery: testing with a small set of production data (users, traffic, etc.) and progressively rolling out changes to larger cohorts until 100% of users in your production environments receive the change.
If you’re not familiar with feature flags, we’ve got a great blog on How To Get Started With Feature Flags, as well as another blog on 5 Feature Flag Use Cases You May Not Have Thought Of.
Without feature flags, you are limited to blue-green deployments, canaries, and other very valuable but slower (and more expensive!) forms of production testing. Feature flags bring the cost of production testing down to almost nothing. They make testing in production simple and easy to coordinate.
Below are some examples of tests in production you might want to run. Typically, these tests relate to tangible business outcomes such as customer satisfaction, or revenue gains:
- Of three possible solutions to this customer problem, which best serves customers?
- I have two ways to implement this algorithm. In test environments, they are about equal, but which performs better when large amounts of data come in via my one million users?
- I want to validate whether this new UX will impact user retention and time-in-app as I expect it to by running tests on real data from users.
- I’m expecting this simplified purchasing process to decrease the number of abandoned carts on my ecommerce page. I want to test this against two groups of users.
- I am rolling out a new set of long-lived operational flags we can flip during scheduled maintenance or outages in my production systems. Does it behave the way I expect against my whole user base?
How to Avoid the Risks of Testing in Production
By definition, testing in production can - and should - cover a lot of territory. You can test load, behavior, resiliency, and more. Therefore, there’s no one-size-fits-all answer. It all comes down to what you want to do.
Testing in production, when done properly by using feature flags, is by nature already a way to mitigate risks. The concept of testing on live data in a production environment is scary, and it’s the use of feature flags that are the most effective risk mitigation tactic! If you implement a good feature flagging solution, then your risk factors are no longer in the technology and its capabilities or limitations, but in organizational process.
There are a few things to keep in mind when thinking about feature flags and testing in production for the first time that will help you mitigate risk in the long term:
- A feature flag gives you options you don’t otherwise have. Instead of thinking, “What am I doing that could use a feature flag?” you can think “I’m making a change, let me put it behind a flag just in case.” It’s always better to have the option, and the cost is close to nil.
- Agree up front what you’d like to learn and what your goal in testing will be. Who owns the test and how will you learn from it?
- Determine before starting a test: who the production audience will be, when the test terminates, and what to do depending on the outcome.
Using Harness Feature Flags to Test in Production
Harness Feature Flags is a complete feature flag management tool that allows users to create and manage flags both in code and through a UI. Harness uses a flexible targeting model that lets you apply your flags any way you want - against users, regions, clusters, accounts on certain billing levels, or anything else you can think of. Take advantage of this to test from a wide variety of angles simultaneously.
In addition to the ability to rig tests, there’s a layer of automation that can also be instrumented. This can especially be helpful if there are standardized tests in production that you run, or you know exactly what you want to happen when certain behavior occurs. For example, you could run three different versions of a new UX, and based on which causes users to stay on the app for the longest time, you want to automatically roll out that feature to the whole user base and remove the other two. That delivery, verification, and trigger-based rollout can all be automated using Harness.
Testing in production has gone from a quip among software engineers to a reality with the spread of modern Continuous Delivery and feature flags.
Feature flags specifically help increase velocity, valuable feedback, and responsiveness while lowering risk and cost. This is critical, because the best tests to learn from are the ones that most match reality. And, reality means production.
While it can initially be difficult to take the leap into production testing with feature flags, there are some questions we can use to get started. It becomes a piece of cake to iterate and expand as your team gets more comfortable.
Have you read our eBook on Feature Flags yet? It’s free and doesn’t require an email address! If you’re looking to learn more, it’s a great resource. Download The Basics: Feature Flags 101 today! We'd also love to point you to our piece on the Best Feature Flag Tools so you can find a solution that works for you.