1. Not calculating sample sizes up front
An oldie but a goodie. If you are not pre-calculating the amount of traffic and conversions you need before you can call a test there will always be the temptation to stop a test early.
Stopping tests early will mean your stat sig calculations are worthless, why you might ask well see below.
If you need a good sample size calculator you can use this one – https://www.evanmiller.org/ab-testing/sample-size.html
2. Stopping tests early
The cardinal sin of AB testing if you stop a test early either because it’s doing great or its tanking you will increase the risk of false positives and false negatives.
Let’s say you have an A and B variant. On day one you get 1000 visitors in both variants. A has 100 conversions and B has 200 conversions. B has a statistically Signiant chance of winning at greater then 99% if you stopped now right? Wrong the next day those figures could easily be reversed. That why we plan how long we need to run tests for based on the amount traffic we typically observe.
3. Stopping tests too late
So as with the above this increases your chances of false positives. Typically what we see is people say they are “close” to significance and will run a test longer then they need to try and achieve the 95% threshold.
It would be better to see the 95% threshold as less of goal and more as an indicator for example you wouldn’t set bounce rate as a goal but it can be indicative when used in the right context
4. Not checking sample ratio mismatch
So sample ratio mismatch is another indicative measure. If you have it means the results are unreliable. Essentially it is when you have a 50/50 split or other equal split and the split does not match the experiment design.
So let’s say we run a strict 50/50 A/B test and we see that A has 10560 visits and B has 10002 it looks roughly equal but it actually has a mismatch that is large enough to invalidate your results.
What can cause this? Randomisation failure, targeting issues, bots, etc…. There are many reasons but if you see this is a test result it means you can’t trust the data and that you have an underlying issues.
Below is a way to check your results for SRM.
5. You don’t QA tests before they go live
One I have seen time and time again. When I say QA I don’t mean you check your phone, laptop and call it a day I mean full device, browser testing that covers the majority of your traffic.
Recently encountered an issue where a client’s whole site was broken in the latest version of android for chrome. If this happens to whole sites let alone AB Tests
6. Not checking for outliers in test results
This one can be especially egregious when you have AOV, Revenue or any non-binomial metric a large outlier can massively throw of a result.
I once had a situation where a test loss on first pass over the data but on closer inspection the control had a £2,000+ order in it that massively skewed the results as the AOV was less than £200.
Same thing can happen if you are using items ordered/added to basket etc… If you have one user that adds 200 items to basket you’re in trouble.
7. Not running for full business cycles
This rule is sometimes hard to implement for some clients their purchasing cycles can be months rather days in which case the issue the choice of the KPI.
Let’s say I’m optimising a car leasing business. Typically car leases are 2-3 years long and as a large purchase the lead time from awareness to conversion might be months.
However if I use micro goals such as performing a search, viewing a product detail page, running benefit in kind calculations etc.. Are all key steps in the purchasing cycle?
These can then be mapped out to show how users typically behave and can be tested for.
8. Not validating tests after they have been implemented
This one is a little controversial. So let’s say you did some testing on the homepage surfacing some key content. The test performed well so you implement then forget.
However homepage performance remains static. What gives? A number of reasons, marketing mix has changed, user needs have changed or even the way it has been implemented is sub optimal or the fact that its hardcoded and there is no variant flash means people don’t notice the change anymore.
Unless you are constantly testing you will never know.
9. Using too low a significance level
So you struggle to get 95% significance so you drop it to 90% its only 5% difference right? No biggie. The only issue is you have doubled your error rate. While you may have produced more winners you have taken your false positive rate from 1/20 to 1/10 so yeah you have more winners but you also have more flukes. Not good.
10. Annualising revenue uplifts
So this is a combination of all of the above. You had 20 winning tests each making an avg of 50k a month extra so that’s a million more a yeah. If you believe that then I have a bridge I need to sell you.
In essence what you are doing is compounding uplifts, ignoring statistical flukes, seasonality, poor test implementations etc…
This will massively lead to you overestimating uplifts in revenue. If you need to prove what your optimisation efforts delivered then give a range a high and low figure and the truth will be somewhere in the middle.
11. Not taking into account the test flash when using client side tools
So all client side tools will cause a flash, depending on the test and the tool this is either super noticeable or not at all.
However as we are programmed to spot what’s different from two images from an early age imagine the effect this has on your tests.
Example you are testing special offer labels on a category page in the variant and the control has none. The page loads for a split second you see no labels then the page flashes and the labels load. Suddenly you’re drawn to the special offer products and you buy one. Without the flash you may have not bought or bought something else completely.
So be careful of the flash and take that into the equation when looking at the results.
12. Too many KPIs
So you run a test and have one KPI with 95% confidence. You run another test that has 3 KPIs at 95% confidence. The latter test will have a false positive rate of 3/20. Unless you do something to correct it.
13. Optimising the wrong KPIs
This is one that everyone does myself included. Your primary KPI should be what you have the most impact on during the test. If you testing a product page you can impact add to basket in a strong way but have a weak impact on overall sales.
In this case add to basket might be my primary and sales might be my secondary. Unless of course what I’m testing is a promotional price or something very salesy.
14. Not using solid evidence for tests
You know what this is copying the competitor, Hippo ideas or just testing something because you’re the expert and you know what works. It’s a crap shoot, sometimes you will hit it lucky sometimes you won’t.
15. Testing small stuff
See this fairly often with marketing people and first time testers you will change button colours, images and headings but forgot about user problems, UX issues etc…
Unless there’s a huge problem like a blue button on a light blue background odds are it’s not going to shift the needle enough to detect.
Remember your sample size calculations are tied to your minimal detectable effect. The smaller the effect the larger the sample size and the longer the test.
16. Not analysing your test results in analytics
Hands up who still uses there testing tool of choice for analyses. If you do then put your hand down and listen up.
STOP IT. NOW.
It’s like using your phones GPS to navigate at sea. Sure they do the rough same job as a proper maritime navigation system.
The difference could be either ending up on a sandbank waiting for high tide or cruising to a deserted island for a picnic.
In your testing tool you can see everyone who was part of the test. So let’s say your test targeted a below the fold feature on the homepage. You targeted the test on the homepage and see users who clicked the CTA in the feature was flat.
However you check in analytics and segment for users that saw the element and see there was a massive spike for your variant.
That’s the difference.
17. Not taking into account ITP
If you don’t know what this is go read this I’ll wait.
Right so now you know or if you don’t know its bad news for AB testing. AB Testing relies on cookies for client side testing.
If your cookies are getting periodically wiped every few days then users could see multiple different versions of an experiment during the testing period.
In which case we need to think about session based conversions
18. Extrapolating test results from one market to another
So you have a multination Ecom site you run a test on the checkout to change let’s say default payment options and it’s a resounding success.
Roll out world wide right. Not so easy each market has very different customer expectations for example in India default payment tends to be payment on delivery of goods or services. What you need to do is to find what’s optimal in each market.
Now not all markets will have enough traffic to test a fairly small change like this so you might have to run 3/4 checkout tests and then roll them up together.
My top tip would be look at the biggest Ecom provider in each country you sell in and use them as a benchmark. Now not all competitors will be optimizing but if an Amazon or other large market player is not I would be surprised.
19. Not having a documented process
So you start a new role and find dozens of tests live with no documentation as to what the test is, what it is meant to be doing and when its meant to end. Fun times.
This usually happens for one of two reasons. There is not a documented process or people are not following the process.
By documenting a process and having it somewhere everyone can see it and everyone can be made to follow it. No process no test, simple.
20. Have the person who suggested the test analyse the test
This falls under cognitive bias, we all suffer from confirmation bias. We all want to find evidence that proves we are right and can easily ignore what proves we are wrong.
So when you can segment a test result a dozen different ways it becomes very easy to find a segment that is statistically significant.