- You don’t know about false positives
- You’re not checking your segments
- You’re testing too many variables at once
- You’re giving up on a test after it fails the first time
You don’t know about false positives
Are you aware that there are actually 4 outcomes to an A/B Test?What do you mean, it’s either a win or a loss, no? Nope, it can be:
- False positive (you detect a winner when there are none)
- False negative (you don’t detect a winner when there is one)
- No difference between A & B (inconclusive)
- Win (either A or B converts more)
- You do “cascade testing”, i.e. A vs B, then B vs C, then C vs D, … THIS IS BAD, DON’T DO IT. We’ll see why in a second.
- You do A/B/n testing, meaning you test all variations in parallel.
Imagine you want to test a different headline for a product page. You have your current one (A) against the new (B). B wins, but your boss doesn’t like the wording and want you to try a slightly different wording. Then you feel like you could do better and change it again. And again. You end up testing 10 different variations of this headline. How is it a problem? Let’s take a look: A vs B gave B as the winner with 95% statistical significance.
As we saw in a previous article, it means that there is a 5% chance this result is a complete fluke or a “false positive”. Then you tested a third headline, B vs C. C also won with 95% significance. The problem is that the chance of a false positive compounds with the previous test. Your second test winner, C, has actually 9% chance of being a false positive. After 10 tests on your headline (C vs D, D vs E, …), even with 95% significance on your tenth test, you actually have a 40% chance of your winner being a false positive! (For 41 variations it becomes 88%!!!) You’d be flipping a coin. Or deliberately shooting yourself in the foot depending how many times you repeat this. Don’t do cascade testing. Just don’t. Okay? Kittens will die if you do. Look at him, we don’t want that, do we ?
A/B/n Testing is when you test n number of variations instead of just one (B) against your control (A). Meaning you have your control A, against variation B, C, D, E, F, etc. at the same time, in the same conditions. This is absolutely fine.
BUT, as we saw in our article on when to stop your A/B tests, you need at least 300 conversions PER variation to call the test off. In our Google example, you would need 41x300 = 12300 conversions. That’s a lot. If you have Google-like traffic it’s okay. If you’re like us mere mortals though, this is a big fat loss of time. You could even be testing for too long and get skewed results. This kind of tests is rarely needed and can often be completely avoided by having a better hypothesis.
You’re not checking your segments
Don’t make Avinash Kaushik sad (one of Web Analytics' daddy if you’re wondering). He has a rule: “Never report a metric without segmenting it to give deep insights into what that metric is really hiding behind it.” Most data you get from your analytics tool is aggregated data. It takes all traffic and mash it out into pretty but absolutely not actionable graphics. Your website has a number of functions, your visitors come with different objectives in mind. And even when they come for the same reason, they probably don’t need the same content.
If you want an effective website, you can’t consider your traffic as a faceless blob, you need to segment. It applies for your tests results. If you don’t segment them, you could be wrongly dismissing tests. An experiment could be resulting in your variation losing overall but winning on a particular segment. Be sure to check your segments before closing the book on a test! Important side note: when checking segments in an experiment result, be sure not to forget that the same rules apply concerning statistical validity. Before declaring that your variation won on a particular segment, check you have enough conversions and a large enough sample size on that segment. Here are three ways you can segment your data:
1. By sourceWhere do your visitors come from (Ads, Social Networks, Search Engines, Newsletter, …)? Then you can look at things like: what pages they go depending on where they come from, their bounce rate, difference in loyalty, if they come back…
2. By behaviorWhat do they do on your website? People behave differently depending on their intent / needs. You can ask: what content do people visiting your website 10+ times a month read vs those only coming twice? What page people looking at 5+ pages on a visit arrived on vs people who just looked at one? Do they look at the same products / price range?
3. By outcomeSegment by the actions people took on your website: bought a product, subscribed to a newsletter, downloaded a premium ressource, applied for a loyalty card, … Make groups of visitors with similar outcomes and ask the same type of questions we asked above. You’ll see what campaigns worked, what products to kill, etc… By segmenting you get actionable data and accurate results. With actionable data and accurate results you can make informed decisions, and with informed decisions … $$$$!
You’re testing too many variables at once
You got the message, you need to test high-impact changes. So you change the CTA, headline, add a video, a testimonial and change the text. Then you test it against your current page. And it wins. Good, right? Well… Not really. How will you know which one(s) of your changes improved conversions on your page vs dragged it down? This is where the question “How will I measure success” takes all its meaning. Testing is awesome, but if you can’t really measure what happened, what moved the needle, it’s not so useful. What have you learned?
That some combination of your changes improved conversions? What if one of those positively impacted conversions and the others dragged it down? You counted the test as a failure and it wasn’t one. Make sure to clearly specify what success looks like and that you’re set up to measure it. If you can’t measure it, you won’t learn. If you don’t learn, you can’t reproduce nor improve it. Don’t test multiple variables at once. Unless you know how to do multivariate testing, then it’s fine. But as it requires a gigantic amount of traffic, we rarely see it used.
You give up on a test after it fails the first timeIf you followed our guidelines on how to craft an informed hypothesis, each of your tests should be derived from (best should be a combination of several):
- Web Analytics
- Usability tests
- User interview
- Heuristic analysis
- You could add testimonials
- You could remove information not relevant to the product
- You could add a video
As you now know not to do cascade testing (which is completely different than iterative because you don’t test X versions of the same headline/picture against the winner of a previous test), or test everything at once, you can embrace iterative testing. There isn’t just ONE solution to a given problem. There are an infinite number of them, and it could very well be a combination of several solutions.
Let’s be a tad extreme to illustrate this. When your internet cuts off. What do you do? If you’re plugged through an ethernet cable, maybe you try unplugging/re-plugging it. If it doesn’t change anything, do you then conclude that your cable is dead, and go buy a new one? Or rather you try to plug it in an another computer, go check your router, restart your computer, check your drivers, … Same thing with your A/B tests ? Don’t give up or jump to conclusion as soon as something doesn’t work.
Look for other solutions and test again, and again. Okay, you are now aware of several ways you could have been misinterpreting your A/B tests results, we’re making progress! Next time, we’ll take a look inside our brains, how they could be playing tricks on us and jeopardizing our A/B Tests *cue spooky music*.PS: Before you go, a couple of things I’d like you to do:
- If this article was helpful in any way, please let met know. Either leave a comment, or hit me up on Twitter @kameleoonrocks.
- If you’d like me to cover a topic in particular or have a question, same thing: reach out.