Everyone loves the story about how a scrappy upstart changed the color of a button on its landing page and suddenly increased its conversion rate by 200%. Stories like that make for a great headlines. They also often spawn copycats naïvely trying to replicate that anomaly in completely different contexts.
Do you remember when Dustin Curtis ran an experiment trying different way of wording a link to his twitter account? He noted that writing “You should follow me on twitter” had a greater conversion rate than just “I’m on twitter.” In response, blogs and websites all over the world began demanding I follow them on twitter in increasingly imperative tones. Did that trick increase the twitter followings of anyone because of this method? I have no idea and neither do the majority of the people who tried it. It really seems like a “tip” that you can easily port to another context and have similar results. However, there is no way to predict what results any change will have – either positively or negatively.
Incremental changes that lead to huge jumps in performance are rare. To be more specific, incremental changes that lead to statistically valid jumps in performance are rare. Most websites do not get anywhere near enough traffic for the data collected in such an experiment to mean anything. It is easy to claim that no two snowflakes are identical when you’ve only looked at a few. It will take a couple of blizzards before you can really say so with confidence.
Even sites that do have significant traffic have difficulty conducting valid experiments. If you don’t come from a math or statistics background, one of the first things you’ll notice about the literature around optimization and testing is how fussy everything is. If you are going to run successful experiments you have to acquire a fussy sensibility.
By knowing what kind of mistakes and weaknesses that have caused other experiments to fail, you can develop a sixth sense for the traps and how to avoid them. Below I’ve listed several common traps that can obscure your real results and leave you spinning like a cat chasing a red laser dot across the floor.
Improperly segmenting traffic
To run a valid experiment you have to isolate the variable(s) that you are measuring. Differences in the time of days, the referral source, browser characteristics, bandwidth and other variables can skew your results in unpredictable ways.
To counter that noise it is important to first “test the Null hypothesis.” To test the Null hypothesis you construct an experiment that divides your web traffic into segments but provides the same experience to each visitor in either segment. The Null Hypothesis states that if you don’t change anything than nothing will be different.
If you divide your traffic and see significant differences in your metrics you have rejected the Null hypothesis. As the wikipedia article and an astute commenter on Hacker News point out, you will never “prove” the Null hypothesis. You can either reject it or fail to reject it.
It is important to understand that the null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it. For example, if comparison of two groups (e.g.: treatment, no treatment) reveals no statistically significant difference between the two, it does not mean that there is no difference in reality. It only means that there is not enough evidence to reject the null hypothesis (in other words, the experiment fails to reject the null hypothesis). le wikipedia
If you have rejected the Null hypothesis, you may suppose that you have either somehow up-ended the laws of causality or screwed something up. No offense, but my money is on you being a screw up. Finding the leak is a matter of combing through the observations to look for unusual results and combing through your code to see if there is a logical failing or hidden dependency that is favoring one result over another.
If your logic is sound yet there are still unexplained differences between the control and treatment, you may want to look at the outliers and see if there is a common thread.
For a web application, browser differences are common culprits. From rendering issues to JavaScript execution quirks, the mix of browsers accessing your site are a mass of conditions that you have to control. Using services like browsershots.org are essential tools for spotting obvious visual differences and errors. You should use your own traffic analysis to see what browsers are worth optimizing for.
Another hotspot is bandwidth issues. The speed of a page load is a more significant variable than most people assume. Google and Amazon have both published reports that show how even sub-second delays can results in outsized lost conversions. The best way to address this issue is to make your site run as fast possible.
Use caching, CDNs, asynchronous scripts and the other tools available to mitigate this factor. You may still wish to collect page load speed results to measure the impact and possibly control for time-bound anomalies like a slow loading 3rd party script or slow and unreliable user connections.
Misunderstanding Randomness
“That’s so random.” is one of the more annoying clichés in recent circulation. In popular parlance the word “random” often means something between “unexpected” and “unusual.” For our purposes we will stick with the more rigorous definition “selected without aim or reason.” Randomness is hard to understand but easy to fake. Our brains are very good at picking up short patterns. Recognizing long patterns, however, is an understandable failing. Our stone age brains tap out a few digits into any jumbled-up looking string.
For a small experiment, fake randomness is likely impossible to detect simply because its effects are less significant than margin of error. In the same way a ship with a broken compass can get across a harbor without getting off track, but could end up in the wrong hemisphere if it sets out for a trans-oceanic journey.
Pseudo-randomness often occurs when the random assignment algorithm uses a value like the visitor’s id to seed the decision. If you assign a visitor to a treatment based on an algorithm like: $treatment = $visitor_id mod 2, you are introducing outside information that will synchronize the treatments in an unpredictable way. What you are then testing is not a random group A vs random group B but a selected group A against the resultant group B. The selection criteria may seem meaningless but it will invariably warp your results.
Solving the randomness problem, as you can see, is very difficult. Fortunately there are plenty of mathematicians, computer scientists and careful coders out there who have built tools to deal with it for you. It’s well out of the scope of this article to describe specific implementations but I can say that the one you wrote sucks. Randomness, like encryption, is best left to a well-supported library.
Mixing Experiment Factors
It is hard enough to test one variation, mixing multiple factors and trying to divine actionable intelligence is a task that requires an enormous amount of data and a mastery of math and statistics that is rarely seen outside research universities or the Googleplex.
That’s not to say you can’t test widely varying treatments against each other. The comparison you can make is between the treatments as a whole, not specific elements.
For example, let’s say you are selling widgets online. You have a landing page that has produced mediocre results. You and your team sit down and brainstorm to come up with a list of changes that you could try. Perhaps rewording the headline? Changing the button text? What about adding a picture near the form? What if we removed the last question on the form? Why not Zoidberg?
Having produced such a prodigious list it is tempting to get in there and really mix things up. If you have a list of 100 things to change, hearing that you only have enough traffic to call for one experiment a month is depressing. “If we’re clever,” you may think, “and really careful, we could make a matrix of all the variations mixing a headline from column A to the button color of Column D…” Before you know it your sifting through piles of data with no clear direction. At worst, you’ll find a real winner but have no idea what it is about that treatment that is responsible for the improvement or if the improvement is just a fluke. To solve this problem, insist on writing a hypothesis for each experiment.
Data Dredging
When you design an experiment, you have to make a hypotheses. Trying to throw everything against the wall and hoping to figure out what makes some parts stick after the fact is a recipe for disaster.
I was in a meeting once where a well-meaning team member was pointing out a peak in a chart. He had discovered that visitors from a particular source who checked a certain option on the form and were also in the 45-65 age range had significantly higher conversion rates than other segments. He proposed a plan to auto-check that particular option for people coming from that source and to segment the incoming traffic to favor older visitors.
What that product manager was doing is data dredging. Another form of this mistake can be seen in the people obsessed with looking for patterns in the stock market or lottery and then attempting to implement a strategy that would have worked had they been using it in the past. It’s an enticing mistake. It seems perfectly logical that if a pattern appeared in the past it likely will repeat in the future.
To picture the problem with data dredging in a more concrete fashion, imagine you own an ice cream store. One day you ask your employees to write the color of the shirt of each customer that comes in today on that person’s receipt. At the end of the day you compile the numbers and find that people wearing red shirts buy nearly twice the amount of ice cream compared to people wearing other colors.
In hopes of capitalizing on this discovery, you research locations and find that people in St. Louis wear red shirts much more than other cities (at least in the summer during baseball season, which as an ice cream maven is the only season you’re concerned with.) Can you guess how well moving all of your operations to St. Louis will help your profits?
You can’t guess, because it is outrageously unlikely that the color of the shirts your customers are wearing has anything to do with your sales figures. To use a cliché, you’ve confused correlation with causation. Even if the correlation is a very high number it doesn’t mean anything unless it’s repeatable and falsifiable.
Later in this series, I’ll be discussing the scientific method and how proper experiments rely on it.
Comparing the results of different and unrelated experiments
A similar mistake is comparing the results of independent experiments with each other. Often this takes the form of comparing the results of a recent experiment with one that ran earlier but not consecutively. Another form of this mistake is comparing the results of experiments that have run for different lengths of time or on different traffic sources. The worst, but surprisingly common, variant is when you compare the results of your experiment with the results of someone else’s – like the published results of another site’s A/B tests.
The only valid comparison is between the control and the treatment(s). Also, that comparison is only valid for metrics that are measuring the same thing. The bounce rate of a squeeze page with one prominent CTA button might be shockingly better than a treatment that asks for an email address in a form on the landing page. That difference in bounce rate tells you something, but if the difference washes out in lead quality, revenue per lead or other metrics that translate more directly into dollars, it’s not telling you anything useful.
Inconsistent or Unimportant Metrics
One thing the web does not lack is things to measure. Never has it been so easy to collect and compare so many attributes of your customers and their actions. Not every metric is equally important.
How to evaluate metrics:
- Is the metric measurable? Is it a discrete value that doesn’t need subjective interpretation?
- Is the metric dependent on something not being measured?
- Is it meaningful? Does it correlate directly with something you want to improve?
- Can you improve it by doing anything on your site? If you can’t do anything about it, why bother measuring it?
You should have core metrics that directly represent your business goals. For many commercial ventures this would be a revenue per X number. In other contexts it might be email newsletter sign ups, social sharing or some other action.
By focusing your attention on the same metrics for each experiment, you will have a context to your learning. Also, you will avoid being led down the rabbit hole of paying attention to misleading or irrelevant measurements.
An example of a misleading measurement is an abnormally high or low result on one metric that isn’t reflected or explained in related metrics. Noise and outliers are expected. Not every anomalous result is significant. You can tell which are significant by testing again (or letting the current test continue) and see if the spike continues to grow or regresses back to the mean.
Another class of misleading measurement is the “Vanity Metric.” A vanity metric is a number that seems significant but has little to no effect on results coming out of the conversion funnel.
Unique visitors and page hits are the most common examples. Those are often the biggest numbers you have but are easy to pump and hard to keep. Your site might get a runaway link on Reddit or a write-up on Techcrunch which will result in a massive influx of visitors. However, they’re not likely your target market and are notoriously adverse to clicking ads or buying things.
Changing the mix of traffic to each treatment is dangerous. If you are running an experiment split between treatment A and B at a 60/40 split and decide to even it out to 50/50, your old data is worthless. You are including people who should have been As in your B group, thus invalidating the earlier data. You can change the mix, but you have to keep the proportions the same.
Naive Analysis of Results
In a later article in this series I will describe concepts like the G-Test, the Z-Test, statistical significance, confidence and other implementation details. For the purposes of this article I’ll summarize it like this: Numbers don’t lie but you can misinterpret the hell out of them.
The best way to combat misinterpreting your data is to figure out what numbers are important and design easy to understand reports that display that data. Automation of these reports is essential.
When you have a set of consistent numbers you can then apply standard analytical tools to understand your experiments. Don’t make the mistake of pulling numbers in an ad hoc fashion. As a human you have millions of years of evolution encouraging you to see patterns in everything.
Define what success means to your business and what numbers represent that. If a number you are using, perhaps the unique visitors per product page, varies more widely than the other numbers you are measuring that is a sign your metric is not tightly correlated with your conversion funnel. The granularity of your metrics needs to match the unit you make decisions. That might be visitors, sessions, events or another metric.
When designing your reports, you have to make sure you are comparing the same metrics. If you compare page views with conversion events and find a strong correlation, you have discovered nothing useful. You can’t just force your current visitors to view more pages. Even if you change your site in a way that encourages more page views, you don’t have any results to suggest that increased page views are causative or an effect of some other unrelated visitor attribute.
Substituting Testing For Creativity and Common Sense
Once you begin to see actionable results from your testing you’ll probably notice your approach to business problems shifting. You’ll find yourself speculating less and proposing experiments more. You’ll be quicker to dismiss fuzzy thinking and un-testable conjectures.
However, you would be amiss to skip the other parts of customer development and marketing. Talking to your users will give you insights unavailable on any chart or graph. Writing clear and compelling copy is an art that pays real dividends. Attractive and persuasive design can put your far ahead of your competitors.
Testing and experimentation are powerful tools that are remaking online and offline business. You must remember that they are tools for refinement and decision-making. What you are testing is still the behavior of people. That behavior is erratic and irrational. No amount of testing will make up for being insensitive to your customers, providing bad service or shoddy products and other marks of a failing business.
Make It Easy On Your Developers
No matter how you design your experiments, the process of deploying and removing the treatments must be simple and safe. If your process is not simple you will be constantly chasing bugs due to bad deployments. If deploying an experiment requires pushing a lot of code to production, you are going to get constant push back from the developers and sysadmins and will be in a real tight spot when a bug takes down the site.
If the process of setting up an experiment is easy and painless, you will create more experiments and find the data they produce more reliable.
Next In The Landing Page Series
I expected to cover the implementation side of A/B testing in this article, but the topic is very deep and I think it deserves its own piece.
So, next time I’ll discuss the data and programming side of A/B testing, using statistical tools and libraries, using third-party services like Website Optimizer, developing an experiment release flow and other implementation details.
If you have any questions, please feel free to hit me up on twitter @muddylemon.