From ‘if’ to automatic A/B testing

In 4 easy phases

10 min readFeb 2, 2022

Many teams use the words’ deploy’ and ‘release’ interchangeably, which signals that they are not utilizing A/B testing. Moreover, they are not using releases as a strategic tool. It is a shame since feature toggling and A/B testing was used in some of the tremendous successes in tech history:

When Facebook launched the most extensive messaging service to the whole world at once,
when Steve Jobs launched the new app store while he was on stage, and
when Obama’s campaign needed to choose a photo for their website.

Like the invention of fire, feature toggling (and its natural extension A/B testing) can be a game-changer, but also, like fire, it can be destructive if used carelessly. This article describes my journey to help clients move to feature toggling and eventually A/B testing safely.

Why?

Before we dive into the technical stuff, let’s examine the ‘why.’ A/B testing extends feature toggling. Therefore the arguments are amplified. So, I will argue for feature toggling but keep in mind these arguments apply even stronger for A/B testing.

Business releasing

As I already mentioned above, a carefully scheduled release time can be a powerful tool. During a public launch, it shows professionalism. But it can also be helpful only to give users access to a new feature after they have completed training. Or allow a certain percentage of users access to a new feature to validate its value before a wider launch.

These effects are impossible if we require a developer or release team to deploy whenever we want to release something. They need decoupling ‘release’ and ‘deploy,’ which is the effect of feature toggling.

Instant rollbacks

If we can release without deploying, it follows naturally that we can also un-release or rollback without deploying. Indeed, the security and stability gained from disabling faulty code without relying on deploys are some significant advantages of feature toggling. When an error is discovered in a new feature in production at 3 am there is no need to call and wake up developers, the stakeholder can go and roll back the feature and fix it in the morning.

Fewer environments

Expanding further on this point, with feature toggling, we can have different features enabled for different users, allowing us to simulate a testing environment inside production for our QA people. This capability, in turn, allows us to eliminate the need to have a complete duplicate of all our infrastructure and only use one environment — in addition to the developers’ local environments.

Smaller batch sizes

Without a doubt, the most significant advantage of feature toggling is that it allows us to reduce our batch sizes in two critical dimensions.

First, because our code is toggled off while we develop it, we can safely integrate our changes into the main branch at any point where it compiles. We want to integrate small changes often because it dramatically reduces the risk of merge conflicts which are both exhausting and error-prone.

The second dimension is that since the code is toggled off, we can safely deploy at any point. We know from the State of DevOps report that deployment frequency is one of the most significant predictors of software delivery performance.

Why build it ourselves?

So, you’re convinced feature toggling is the way to go, but you think why build it ourselves? Why not just adopt one of the various solutions out there?

I always recommend that teams start by implementing their simple version since the most challenging thing about feature toggling is getting rid of them again. Until I have seen teams consistently rotate toggles in and out, I cannot guarantee the toggles will be helpful.

ThoughtWorks observed another vital point in their tech radar where they have long recommended the simplest possible feature toggle since they saw teams struggle with the complexity when going straight to the full-blown ‘unnecessarily complex feature toggle frameworks.’

Phase 1: Feature-ifs

Now, let’s get to the building. I have already hinted at the first stage: learn to get rid of toggles. Of course, to get rid of them, we have first to introduce them. To make them as easy as possible to remove, we use the most straightforward construct possible: ifs.

Let’s go through the whole lifecycle of a feature flag with an example. Before we start, we have some code we want to modify:

void foo() {
  ...
  [old code]
  ...
}

The first thing we do is to introduce if-false-else and duplicate the current code in both branches.

void foo() {
  ...
  if (false) {
    [old code]
  } else {
    [old code]
  }
  ...
}

To make it easier to find this later, I recommend extracting the boolean to a named function that we can search for later, or our compiler can help us find it.

class FeatureToggles {
  boolean featureName() { return false; }
}
void foo() {
  ...
  if (FeatureToggles.featureName()) {
    [old code]
  } else {
    [old code]
  }
  ...
}

We can now modify the top version without fear of affecting behavior.

class FeatureToggles {
  boolean featureName() { return false; }
}
void foo() {
  ...
  if (FeatureToggles.featureName()) {
    [new code]
  } else {
    [old code]
  }
  ...
}

When we want to release our modification, we replace false with true and deploy normally. At the same time, we create a work ticket in our ticket system to remove this feature-if. The ticket should be due at the latest in six weeks.

class FeatureToggles {
  boolean featureName() { return true; }
}
void foo() {
  ...
  if (FeatureToggles.featureName()) {
    [new code]
  } else {
    [old code]
  }
  ...
}

Six weeks later, we return to the if and see whether it is true or false and remove the dead part, without question, even if it means removing the new code. If there was a problem with the new code that we could not fix in the six weeks, it is too low a priority, and we should postpone it and redo it later. Otherwise, we don’t have sufficient understanding now and should consider the first attempt a spike, delete it, and try again. Usually, though, it is turned on.

void foo() {
  ...
  [new code]
  ...
}

We still need deployment to release functionality during this phase, so we have not yet decoupled the two. Therefore it is also still a developer task to release. We do, however, already get the advantages of smaller batches.

Phase 2: Environment variables

Once we have validated that we can get rid of the ifs, we can start using them. It is time to decouple release and deploy. The easiest way to achieve this is to replace the hardcoded boolean with an environment variable when the functionality is ready for release. Notice that we still use if-false while we are developing.

class FeatureToggles {
  boolean featureName() { return ENV["featureName"]; }
}
void foo() {
  ...
  if (FeatureToggles.featureName()) {
    [new code]
  } else {
    [old code]
  }
  ...
}

We can release and roll back without deployment with environment variables, which is fantastic. This implementation does also come with a few pitfalls, though. The first risk is accidentally reusing an old flag-name. Doing so can cause us to release something unintentionally, with possibly catastrophic effects. To mitigate this risk, we should continuously maintain a list of all previously used names and never reuse any. Another pitfall is when our app runs on multiple machines with separate environment variables; in this case, we need to be vigilant to update the values on every instance.

We have decoupled release and deployment; however, because environment variables control the release, developers, not business people, still have to do it.

Phase 3: UI

The next phase is all about moving the responsibility to the stakeholders. To do this, we move the toggles to a database and expose them through a straightforward UI. Implementing these is especially easy if our software already uses a database and has some UI we can plug into.

As toggles are now in a database, we can decorate them with additional attributes. Here is when feature toggling starts to shine because we can make two excellent extensions.

We can add segments and make many-to-many relations between users and segments and segments and features. We can now control which types of users sees which features. Popular segments include testers, super users, early adopters, regular users, and late adopters.

Instead of a toggle being either on or off, we can use probability to decide how many users see the new feature. Doing this allows us to do a slow rollout, ensuring we don’t see a rise in errors or other adverse effects from the minority of users seeing the new feature. We can achieve this by adding a probability column and then comparing it to a random number when checking it.

Phase 4: Automatic

The final phase has a bit of a prerequisite because we need to either detect the success or failure of the interaction or session. This effect is only possible if we have some monitoring or error handling facilities in place. However, if we do have that, this phase is no more challenging than the preceding. The goal is to make the rollout automatic, so we don’t have to monitor whether the new feature is successful or not. It sounds scary and complicated, but it is surprisingly easy to build a basic version of this.

We begin by splitting the probability into two attributes, one counting success and one counting failure. Then each time we read a toggle, we keep track of whether it was on or off, and if the session is a failure, we add one to the opposite column of all the toggles used in this session. Similarly with if the session is a success. We also update our selection function to the formula: random() < succ/(succ+fail). This explanation can sound a bit convoluted so let’s look at an example.

User A enters our website. While browsing, he triggers two feature toggles, T1 and T2. For this session, T1 is enabled, and T2 is disabled. Unfortunately, he encounters an error a few minutes later. The system detects this, and because T1 was enabled, it adds one to its failure column, T2 was disabled, so it adds one to its success column. The session resets, and the user can try again, getting a new combination of toggles.

This basic implementation is strong enough to detect faulty features and automatically disable them, even when errors arise from complex interactions of multiple toggles. Thus whereas it seems uncomfortable to relinquish control of ‘what is running in production,’ it is, in fact, much safer.

We only have to set both failure and success to some positive number to start the process. If we want a fast rollout, we can set both to one. If we wish to roll out slowly, we can set failure to 9 and success to 1, which will show the new feature to 10% of users — in the beginning. Remember, we still need to examine the state of the toggle after at most six weeks and remove the if; only now we remove the path with the least probability.

That’s it! You have built your own automatic A/B testing framework with machine learning and everything. Well done!

Why not?

For completeness, though, let’s look at some of the reasons you might not want to use feature toggling — some of the drawbacks.

Double maintenance

As we saw earlier, the first step is to duplicate code. Code duplication goes against our instincts as developers because if we have to adjust something in one version, we now have to do it in two. We have double maintenance for the duration of the toggle, but it gets even worse when you realize that we probably want to feature toggle our bugfix as well, meaning that instead of one, if we now have three, resulting in four copies of the code. This state explosion can quickly get out of hand. The remedy is to make the toggles as short-lived as possible. As with smaller batches, shorter toggles carry less risk.

Complexity

We spend a lot of our time looking at and trying to understand code. Each branch adds to what we have to keep track of while working with the code. Therefore each toggle has an actual price in that the code gets more complicated with each if. We are again underlining why we want toggles to be short-lived.

Testing is harder

Some of us like to test our systems before letting users interact with them. The problem is that if we have five toggles which of the 32 possible configurations of toggles should we test? All of them? Just one? This question leads us to another instance of the state explosion mentioned above, and it is a difficult question to answer. I recommend that people test with the configuration they expect the system to run; typically, this means all features on. However, since rollback is free and sometimes even automatic failures for the users are usually a lot less severe.

Conclusion

In this article, I have described with concrete examples how to evolve; first process, then data, then system to ultimately support any number of advantages of feature toggling and A/B testing. We have discussed both reasons for and against feature toggling, and I hope I have shared a bit of my enthusiasm for the topic.

I want to mention that there are even more details about feature toggling and a bunch of other technical excellence stuff in my book. With plenty of concrete examples and helpful advice, it is a great companion for developers of any seniority: