What Does the A/B Testing Process Look Like?
There’s much more to A/B testing than creating a variation of your original page and driving traffic to it. It involves careful planning, patience, and a little bit of statistics.
Let’s break it down step-by-step.
Step 1: Set conversion goals
Before you start testing just anything, it’s important to dive deep into your data. Why do you want to A/B test?
The answer to that question will come after you figure out how visitors are interacting with your page.
Does Google Analytics show they’re abandoning it immediately? You might want to try testing better message match.
Has heat mapping software indicated they’re not noticing your call-to-action button? You might want to test a bigger size, different location, or more attention-grabbing button color.
The following are just a few examples of tools you can use to gain better customer insight for your A/B tests:
- Google Analytics
- Instapage Analytics
- Heat maps
- User recordings
With an idea of how your visitors behave on your page, you can then formulate a pre-test hypothesis.
Step 2: Develop a hypothesis
Now it’s time to ask yourself, “What am I trying to improve with this A/B test?”
Let’s use the example above since it’s conceptually easy for beginners to grasp. If you’re changing your button color because you noticed that it’s not getting noticed, your hypothesis would look something like this:
“After viewing numerous user heat maps, I discovered that visitors aren’t noticing the call-to-action button. Therefore, I believe adjusting the color will draw more attention to it.”
Without a clear hypothesis, there is no clear goal of your test. But, how are you going to do that, and why are you making the adjustments you are?
To a beginner, “what is my goal?” seems so straightforward that it’s not worth asking. Most assume that the ultimate goal of any A/B test is to boost conversions, but that’s not always the case. Take this test from Server Density for example.
Instead of boosting conversions, their goal was to increase revenue. So, they A/B tested a new packaged pricing structure against their old one.
Here’s the old pricing structure, based on the number of servers a customer needed monitored by the company:
Here’s the variation — packaged pricing that started at $99 a month.
After testing one structure against the other, they found the original structure produced more conversions, but that the variation generated more revenue. Here’s the data to prove it:
While conversion rate is, many times, the most important metric to track in an A/B test, it’s not always. Keep that in mind when you’re determining your hypothesis and goal.
Step 3: Make your adjustments
After you’ve determined what you’re changing and why you’re changing it, it’s time to build your variation.
If it’s the layout of your original page (aka the “control” page) you’re adjusting, adjust it in your variation page. If it’s the effect of a new button color that you want to see, alter the color.
If you’re using a professional WYSIWYG (what you see is what you get) editor, you’ll be able to click to edit elements on your page without IT staff. If you’re not, you’ll need the help of a web designer with coding experience to set up your test.
Different types of A/B tests
Some marketers claim that to run a “true” A/B test, you can only test one element at a time — be it a headline, a button color, or a featured image, etc. That’s not entirely correct, though.
Testing one element at a time
Many conversion rate optimizers prefer to test only one element at a time for reasons of accuracy and simplicity, since it can be difficult to test multiple elements at once. When your variation page varies in only one way from your control page, then when one outperforms the other, the reason is clear why.
For example, in the hypothetical test above, if your control page and your variation page are identical except for a different colored button, then you know at the end of the test why one page beat the other. The reason was button color, like in this test from HubSpot:
Both pages are identical except for button color. The one with the red button converted better at the end of the test, so that mean the red button is likely the reason for the increase.
If you have the time, resources, and vast amount of traffic needed to conduct tests like this one, then go for it. If you’re like most businesses, though, you’ll want to alter more than one element per test.
Testing multiple elements at once
Unfortunately, it can take months to conclude just one A/B test. That being the case, many smaller businesses without thousands of visitors readily available will choose to change more than one page element at a time, then test the pages against each other.
Contrary to popular belief, this is not a multivariate test, because you’re still testing your control page (A) vs. your variation page (B). The difference between this method and multivariate testing is that once you do declare a winning page, you won’t know why it won. Some call it a “variable cluster” test. This one, conducted by the MarketingExperiments team, tested a control page against a variation with a different headline, more copy, and social proof.
The result was an 89% boost in conversions. However, because many elements were changed in this test from the original, it’s impossible for the MarketingExperiments team to know which new elements were most responsible for the change.
A multivariate test, on the other hand, can tell you how different elements interact with each other when grouped into one test. But it does require some extra steps that make the method a little more complicated than A/B testing. It also requires a lot more traffic.
If the idea of not knowing why one page performs better than the other bothers you, then you’ll want to test one element at a time or learn how to conduct a multivariate test.
But if you don’t have the time or resources to change just one element per test, then altering multiple elements per test may be the way to go.
Many digital marketing influencers, like Moz’s Rand Fishkin, recommend using this method to evaluate radically different page designs in order to find which is closest to the global maximum (the best possible landing page design). Testing singular page elements — trying to find which button color or headline converts best, for example — can help you improve your current page. However, by simply improving your current page, you assume that you were on the right track to begin with. Maybe there’s a completely different design that would perform better. In a blog post titled “Don’t Fall Into the Trap of A/B Testing Minutiae,” Fishkin elaborates:
“Let’s say you find a page/concept you’re relatively happy with and start testing the little things – optimizing around the local minimum. You might run tests for 4-6 months, eek out a 5% improvement in your overall conversion rate and feel pretty good. Until…
You run another big, new idea in a test and improve further. Now you know you’ve been wasting your time optimizing and perfecting a page whose overall concept isn’t as good as the new, rough, unoptimized page you’ve just tested for the first time.”
As the image above shows, testing a singular element usually yields a minimal lift, if any. It’s a lot of time for a little return. On the other hand, testing a radically different design centered around a new approach can yield a big conversion boost.
Take this Moz landing page for example, tested against a variation created by the team at Conversion Rate Experts:
The variation was six times longer and generated 52% more conversions! Here’s what the testers had to say about it:
“In our analysis of Rand’s effective face-to-face presentation, we noticed that he needed at least five minutes to make the case for Moz’s paid product. The existing page was more like a one-minute summary. Once we added the key elements of Rand’s presentation, the page became much longer.”
If they had simply tried to improve the existing elements on that original page with tests like button color vs. button color, or call-to-action vs. call-to-action, etc., then they would never have built a longer, more comprehensive, and ultimately higher-converting landing page.
Keep that in mind when you’re designing your test.
Step 4: Calculate your sample size
Before you can conclude your test, you need to be as close to certain that your results are reliable — aka, that your new page is actually performing better than the old one. And the only way to do that is by reaching statistical significance.
What is statistical significance?
Statistical significance is a measure of, to a degree of confidence, how sure you can be that your A/B test results aren’t a fluke. At an 80% level of significance (also referred to as “confidence level”) you can be 80% sure that your results are due to your change and not chance.
To reach statistical significance, you’ll need an adequate sample size — aka, number of visitors to your page — before you can conclude your test. And that sample size will be based on the level of significance you want to reach.
The more certain you want to be about the results of your test, the more visitors you’ll need to generate. To be 90% sure you’ll need to generate more visitors than you would need to be 80% sure.
Keep in mind, the industry-standard level of statistical significance is 95%. But many times professionals will run tests until they reach 96, 97, and even 99%.
While you can never be completely certain your new variation is better than your original, you can get as close as possible by running your test for as long as time allows. Even experts have noted many times in which their variation page was beating the original at a 99% level of significance, then lost to the original after the test was run for longer. Peep Laja describes one:
“The variation I built was losing bad — by more than 89% (and no overlap in the margin of error). Some tools would already call it and say statistical significance was 100%. The software I used said Variation 1 has 0% chance to beat Control. My client was ready to call it quits. However since the sample size here was too small (only a little over 100 visits per variation) I persisted and this is what it looked like 10 days later.
The variation that had 0% chance of beating control was now winning with 95% confidence.”
The longer you run your test, the bigger your sample size. The bigger your sample size, the more confident you can be about the results. Otherwise, Benny Blum notes with a simple example, a small sample size can deliver misleading data:
Consider the null hypothesis: dogs are bigger than cats. If I use a sample of one dog and one cat – for example, a Havanese and a Lion – I would conclude that my hypothesis is incorrect and that cats are bigger than dogs. But, if I used a larger sample size with a wide variety of cats and dogs, the distribution of sizes would normalize, and I’d conclude that, on average, dogs are bigger than cats.
Either way, to figure out how many visitors you’ll need to generate before you can conclude your test, you’ll need to know a few more things — like baseline conversion rate and minimum detectable effect.
What is your baseline conversion rate?
What’s the current conversion rate of your original page? How many visitors click your CTA button compared to how many land on your page?
To figure it out, divide the number of people who have converted on your landing page by the total number of people who have visited.
What is your desired minimum detectable effect?
Your minimum detectable effect is the minimum change in conversion rate you want to be able to detect.
Do you want to be able to detect when your page is converting 10% higher or lower than the original? 5%? The smaller the number, the more precise your results will be, but the bigger sample size you’ll need before you can conclude your test.
For example, if you set your minimum detectable effect to 5% before the test, and at its conclusion your testing software says that your variation page is converting 3% better than your original, you can’t be confident that’s true. You can only be confident of any increase or decrease in conversion rate that exceeds 5%.
Now you can punch in both these numbers, along with your desired level of confidence (we recommend at least 95%), and the calculator you use will generate a number of visitors you need to reach before you can conclude your test.
However, reaching that number of visitors doesn’t always mean it’s safe to conclude your test. More on that next.
Step 5: Eliminate confounding variables
It’s important to remember that you’re not running your test in a lab. And because of that, there are outside factors that threaten to poison your data and ultimately the validity of your test. Failing to control for these things to the best of your ability can end in a false positive or negative test result.
Things like holidays, different traffic sources, and even phenomenons like “regression to the mean” can make you see gains or losses where there really are none.
This is a step that spans the entirety of your test. Validity threats will inevitably pop up before you reach statistical significance, and some of them you won’t even notice. To learn about a few of the most common to watch for, see Chapter 4.
The difference between validity and reliability
If you’re reading along wondering, “Reliability? Validity? Aren’t those the same thing?”, allow us to clarify.
Reliability comes from reaching statistical significance. Once you’ve generated enough traffic to form a representative sample, you’ll have attained a reliable result.
In other words, if you were to run your A/B test again would you get the same result? Would your variation still beat your original? Or would that original beat the variation?
A test that reaches 95% significance or higher is said to be reliable. The outcome will likely be the same even if you run the test again.
Validity, on the other hand, has to do with the setup of your test. Does your experiment measure what it claims to measure.
In our earlier example — when you set up a test that aims to measure the impact of button color, does it actually? Or could your test results be attributed to a factor outside of button color?
Even if your variation page is beating your control at a 99% level of significance, it doesn’t mean the reason it’s winning is your button color.
Step 6: QA everything
The test before the test is one of the most crucial steps in this process. Make sure all the links in your ads drive traffic to the right place, that your landing pages look the same on every browser and that they’re displaying properly on mobile devices.
Now test the back end of your page. Fill out your form fields and make sure the lead data is being passed through to your CRM. Are pixels from other technologies firing? Are the other tools you’re integrating with registering the conversion?
Forgetting to check these and realizing later that your links are dead or your information isn’t being properly sorted can completely ruin your test. Don’t let it happen to you.
Step 7: Drive traffic to begin your test
Now it’s time to begin your test by sending visitors to your page. Remember, both your pages’ traffic sources should be the same (unless you’re A/B testing traffic sources, of course). An email subscriber will behave differently on your page than someone who’s unfamiliar with your brand. Don’t drive email traffic to one and PPC traffic to another. Make sure you’re driving the same type of traffic to both your control and variation, otherwise you risk threatening your test’s validity.
Step 8: Let your test run
Be patient. Even if you have the massive amount of traffic required to reach statistical significance in a matter of days, don’t conclude your test just yet.
Days of the week can impact your conversion rate. People visiting your landing page on Monday and Tuesday can behave differently than ones visiting it on Saturday and Sunday.
Let it run for at least three weeks, and test in week-long cycles. If you start your test on Monday and you reach statistical significance three weeks later on a Wednesday, continue driving traffic to your pages until the following Monday.
And remember, the longer you run your test, the more reliable your results will be.
Step 9: Analyze the results
Once you’ve reached statistical significance, you’ve let your test run for at least three weeks, and you’ve reached the same day of the week you started on, it’s safe to call your experiment.
Remember, though, the longer you run it, the more accurate the results will be.
What happened? Did your variation dethrone your control? Did your original reign supreme? Did you eliminate major confounding variables, or could your results be attributed to something other than what you set out to test?
Regardless of those answers, the next step is the same.
Step 10: Keep testing
Just because you improved your conversion rate doesn’t mean you found the best landing page. More tests have the potential to produce bigger gains.
If your variation lost to your control, the same thing goes. You didn’t fail, you just found a design that doesn’t resonate with your audience. Figure out why, and use that knowledge to inform your tests going forward.