A/B Testing For Feature Flags, What It Is And What It Shouldn’t Be

Authors:

Table of Contents

A/B testing with feature flags can serve engineering verification or behavioral analytics. While feature flags optimize deployment and quality, A/B tests focus on user experience and business goals. Use specialized tools for each to maximize effectiveness.

Intro

We talk to teams daily that have A/B testing on the list of their needs from a feature flag tool. However, what A/B testing is and where you may want to get it from usually becomes a more nuanced conversation than just “included” or “not included.”

In this post, we want to look at what people mean when they talk about A/B testing and share how we think about it in relation to feature flags.

Two Meanings of A/B Testing

There are two pretty different things people can mean when they talk about A/B testing in the context of feature flags:

Engineering verification. This is when teams want to use their production environment as a safe test environment in order to see the performance or cost impact of a change and compare it to their baseline version to quantify the impact of a change before progressing it to a wider release. This is a great use case for feature flags.
Behavioral analytics. This is when a team is trying to find out if a blue button converts better than a red button or if a different layout keeps people on the page longer. Other than being implemented via a feature flag we think that coupling this with a feature flag vendor is the wrong way to go.

For the rest of this post, we’ll focus on the confusion around the second point. In a future post, we’ll talk more about the engineering side of A/B testing and the process of verifying the impact of your code changes on your overall system’s performance and cost.

High-Value Feature Flags And High-Value A/B Tests

Let’s take this as an assumption - you want to get the absolute most value possible out of both your feature flags and your A/B tests.

To do that, it helps to understand that they were serving two different purposes, and often for two different people in the organization.

Feature flags are a product and engineering process, breaking release apart from deployment in order to accelerate velocity, improve customer experience, and empower organizations to release more and run higher quality software.
A/B testing, in the behavioral analytics sense, is about learning and optimizing the user experience for business goals. It’s usually for product management and marketing, and the KPIs are usually growth or revenue related, not velocity, performance, and resiliency related.

A/B tests are implemented pretty similar to a feature flag - it’s essentially just a diff in the code serving one path vs. another conditionally - but the concerns before and after the implementation for the two are pretty different.

With feature flags, you want governance, developer experience, release automation, reporting, and lifecycle management. With behavioral analytics A/B testing, you want data science and growth/revenue-based correlations. This often involves delving into topics like true positive rates and false discovery rates and understanding the difference between type two error scenarios and accurate results.

From a user experience, it’s very unlikely that a tool that provides a great experience for engineering is also providing a great experience for marketing. These are significantly different audiences with different lifecycle concerns and different optimizations needed.

However, because the implementation is so similar, you do see on the market companies that bolt them together. This results in great A/B testing companies with very limited feature flag offerings or feature flag companies with very minimal, hard-to-use A/B testing offerings. We don’t like either approach.

Our Strategy

On top of the overall differences between feature flags as an engineering process and A/B testing as a growth and revenue process, we also find that increasingly, most tools have world-class analytics tools in place anyway and don’t need more data siloing and more tools to log into.

So, here’s how we see it - you should use the best data analytics tools in the world for A/B testing, and you should be able to implement those tests via Feature Flags that are focused on absolutely maximizing the value of feature flags in your software delivery process.

Increasingly, best-of-breed analytics tools with A/B testing - such as Amplitude Experiments, Statsig, and Growthbook - allow for Segment implementations of the data payload needed to run their experiences. We are connecting our Feature Flags to Segment to automate the process of using Harness Feature Flags with any of the best A/B testing analytics vendors on the market, allowing you to keep your data all in one place and letting you use the best tool for each job. Even until it’s fully automated, adding a simple segment call to your flags is simple and adds minimal overhead for the value it unlocks.

At Harness, we are laser focused on solving the problems associated with software delivery. It can be tempting to drift into adjacent areas, but when the problems are far apart, you deliver weak offerings that don’t satisfy the end users – and that’s just not our approach.