Holdback experiments offer a strategic way to confirm long-term effects by exposing a small user segment to the original experience while rolling out improvements to the majority. This method enables safer decision-making for compounding metrics like retention, without delaying immediate business gains.
Some experiments require a long time to reach significant results. If one option is preferable in the short term, you can split unevenly for a long period of time: expose 95% of traffic to the preferred option, and 5% to the other one. That pattern and the second variant are both known as a holdback.
Some changes to your web service or product, like making the purchase flow easier to navigate, are meant to raise business-critical metrics immediately. Others, like a new channel for customer service, might improve customer satisfaction rapidly but will only have a measurable, compounding effect on retention and other business-critical metrics in the long run. You can confirm that customers like the new option by looking at the Net Promoter Score (NPS). However, should you expose half of your users to a worse experience for months to measure the impact on churn?
There are many cases where an experiment should not last more than a few weeks, three months at most, to keep the product cycle manageable. However, some effects, like customer churn, can take longer to measure. Say you want to measure the impact of your change on churn. Say your customers book a holiday or review their retirement plan only once a year. In either of those case, a ten-week experiment is too short to expect any customer return and gather data to measure churn.
There are several options:
An approach that we could recommend is to run the experiment as expected but to set the short-term goal, like customer satisfaction survey as your objective criteria, and roll it out to all customers if the impact after a few weeks is significantly positive. Months later, you can check whether your overall retention has indeed improved compared to before your experiment. That comes with the limit of a Before-and-After comparison.
With that third approach, you can still measure what it’s like to have better customer service for a couple of purchase cycles; not only that, you can also measure the impact of expecting excellent service, time after time over extended periods. For example, it might increase entitlement, it could affect the brand positively, it could drive stories about exceptional situations where a better service was helpful.
The first question from your statisticians or analysts will likely be “Would we be able to measure the impact over only 5% of the audience? Wouldn’t that mean three times less power?” It would, roughly (5% is ten times fewer units than 50% and, following the central limit theorem, √10 is about 3) but a longer test set-up would be more sensitive: more visitors can enroll in the experiment and some effects compound.
More importantly, with customers being exposed to customer service multiple times, their retention should not just improve but compound. If your retention improves by 10% over one month, it’s 21% better after two, 77% better after six months. That’s several times larger. Those more consequential effects are easier to detect.
If you run a balanced 50/50 test, you know which variant offers the most short-term positive value, or which one is the most promising overall. To minimize the negative impact on the business from testing, you want to roll out to 90 or 95% of user population the variant with the most promising outcome, especially on leading indicators: best customer satisfaction, most items marked as favorites, etc.
You can decide to pick the option that will be easiest to deactivate, in case the holdback experiment gives surprising results. Introducing new interactions means that removing them will come at a cost. Keep in mind however that a hold-back is here to confirm a previous result, possibly measure its impact more accurately—it rarely flips the overall outcome.
Another way to decide which option to prioritize is to think about the possibilities that this opens. Allowing customers to identify their favorites (without buying) allows you to reactivate them with more purchase opportunities. It allows your machine learning team to train better recommendations. Those improvements can contribute to assigning more value to your preferred option.
Of course, if your users talk to each other, those left behind might resent that they don’t have a better experience. You might get bad press from the discrepancy. Exercise discretion and override the holdback when it is more expensive than interesting. Still, this effort will be beneficial in the long run to convince your executive stakeholders to invest in better service for long-term objectives.
When running this process, one of the most common issues is maintaining the old code or the previous operational processes for longer. That is a legitimate source of concern for the software engineers and the operational managers who will want to move on. Getting them engaged with the process is critical. You should explain the value of experimentation and why a holdback is useful when dealing with long-term effects. They will generally understand that this aligns with their objective of having a more streamlined experience and investing to resolve technical and operational debt in the long term too.
When you run an experiment that proves beneficial over the short term, you will want to roll it out as soon as you have significant results. However, if you still want to investigate its long-term effect, you also need to know the reliability of an experiment. To make sure as many users as possible benefit from the improved experience, roll it out to the majority of users, say 95%. Keep a minority of users in long-term control group. This is known as a holdback.
After several months, you should have a strong signal about the long-term impact on key metrics, notably those that compound. Remember to switch the holdback to the new experience when your experiment is over.
Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.
Split gives product development teams the confidence to release features that matter faster. It’s the only feature management and experimentation solution that automatically attributes data-driven insight to every feature that’s released—all while enabling astoundingly easy deployment, profound risk reduction, and better visibility across teams. Split offers more than a platform: It offers partnership. By sticking with customers every step of the way, Split illuminates the path toward continuous improvement and timely innovation. Switch on a trial account, schedule a demo, or contact us for further questions.