What are Type I and Type II Errors?

Type I and type II errors in hypothesis testing both occur when a tester or statistician draws inaccurate conclusions from experimentation or datasets. The main difference between the two is how the tester reaches their incorrect conclusion:

Type I errors occur when you incorrectly reject a true null hypothesis.
Type II errors occur when you fail to reject a false null hypothesis.

With both type I and type II errors, you draw wrong inferences, either seeing effects from your experiments that are not there, or not seeing them when they exist. They are known colloquially as false positives (type I) and false negatives (type II).

Type I and type II errors happen for various reasons across the experiment lifecycle. They can arise from failing to plan a statistically significant or a sufficiently-powered test, from peeking at its results early, or incidentally from poor sampling.

In this article, we'll cover:

What are type I errors?
What are type II errors?
What is the difference between type I and type II errors?
How to approach and minimize type I and type II errors
The bottom line for type I and type II errors

What are type I errors?

A type I error happens when you run an experiment and wrongly conclude that the change you tested impacted your target metric.With type I errors, you see an effect that’s not there and reject your null hypothesis based on the observation.

Your probability of making a type I error depends on your experiment’s significance level. Generally, a web experiment's significance level is set at 5% or 0.05, which means your chance of making a type I error is 5%.

Understanding type I errors

When you begin your experiment, you set its significance level threshold (α). Your experimentation solution then compares your test’s p-value (the probability of obtaining test results at least as extreme or more extreme as the results actually observed) with the significance level you set. If your test’s p-value is lower than your significance level threshold (α), then you're looking at statistically significant results, and you can reject your null hypothesis safely and avoid making a type I error.

A type I error example in web experimentation

Suppose you’ve hypothesized a week-long A/B test on your homepage’s header messaging. You settled for a week as that would feed both the control and challenger an adequate sample size. Also, let’s say you’ve set the statistical significance level for this test to be the default 5%.

Assuming you started this test on Monday, you should wait until the coming Monday to see your result. But you checked your experimentation solution on Wednesday—peeked!—and found your test to be statistically significant with the challenger beating the control. Following your observation, you stopped the experiment, concluding that the null hypothesis was false and declared version B the winner.

After rolling out the change for all users, you end up with more or less the same numbers as before. The effect didn’t hold after all.

You committed a type I error!

This example also shows how not peeking is one way to avoid making a type I error. This is because the p-value for an experiment can change on a daily basis, often falling under the threshold in the interim and only hitting its final value when the experiment runs its intended length.

Of course, not all type I errors are a result of stopping experiments early, but they are a common culprit. Some experiment types, such as sequential testing, even allow for controlled peeking.

Run feature experiments and release winning products with confidence. Try for free

What are type II errors?

A type II error is when you run an experiment and conclude that the change you tested didn’t impact your target metric when, in reality, it did. In other words, with type II errors, you miss the effect your experiment produces and fail to reject your null hypothesis.

Your probability of making a type II error depends on your experiment’s statistical power. Generally, statistical power is set at 80% or 0.80, which means there’s an 80% chance that your experiment will be able to detect any actual effect that your experiment causes. This also means there’s a 20% chance that you could miss the real impact of your experiment and make a type II error.

Understanding type II errors

Before you set up your experiment, you run a "power analysis" to determine the sample size you'll need to achieve your desired power level, significance level, and the expected effect size. Then, you input this sample size into your experimentation solution.

Once your experimentation solution delivers your experiment to the target sample sizes and it runs its intended length, you have a winner. Running an adequately powered test for its intended length is how you detect any true effects that your experiment causes and avoid creating type II errors.

A type II error example in web experimentation

Let's return to our header messaging A/B test example. Suppose this time, you had to show each version to 10,000 users for reliable results. But, because you didn’t see a winning version immediately, you concluded your test before meeting your sample size requirements, noting that your change did not lead to a statistically significant result.

A short time later, you interview a customer you acquired during the experiment and ask them about what prompted them to convert. They underline how the messaging spoke to them, so you decide to rerun this test. This time, you let it run its intended length and saw that the challenger beat the control.

So, you had initially committed a type II error in this experiment.

What is the difference between type I and type II errors?

When you make a type I error, you reject a null hypothesis even though it's true, whereas when you commit a type II error, you fail to reject the null hypothesis when it is false.

In terms of web experimentation, with type I errors, you detect effects that don’t exist. In contrast, with type II errors, you fail to detect existing effects.

With type I errors, you’ll end up rolling out a version you believe will impact your target metric positively. At best, such a change won’t impact your numbers (as the lift you saw won’t hold)—but you can end up hurting your conversions, too; the data you're working with is untested, and could therefore go any way.

On the other hand, with a type II error, you’ll end up sticking to the status quo, and any improvement that the experiment would have made is lost. Both can prove costly in their own ways.

Type 1 errors are easier to detect as the “uplift” you saw didn't sustain. Another way to detect a type I error is to see a negative impact on your target metric after the full rollout. In contrast, type II errors are more difficult to detect as you conclude the challenger was no better than the control and stick to the default.

Type I and type II errors can lead to incorrect conclusions from data.

How to approach and minimize type I and type II errors

Reducing the likelihood of one type of error makes you more vulnerable to making the other. If you reduce the value of your significance level threshold, you lower your chance of making a type I error. So if instead of setting alpha to 5%, you set it to 1%, you’ll be 99% sure that the results you’re seeing from an experiment are there for real and not by chance.
However, reducing alpha reduces a test’s power, making you more vulnerable to a type II error. A test that’s not sufficiently powered will not be sensitive enough to detect differences. Lowering alpha means while you’ll likely not detect a non-existent effect, you might miss it when it’s there.
Get your test logistics right. Go for a 5% statistical significance level threshold and 80% power. There’s nothing magical about these numbers, but they’re the standard for web optimization.
A good sample size is your best defense against both type 1 and type 2 errors. Keep in mind, however, that a considerable sample size means your test takes more time to run, which impacts your testing velocity. For low-traffic sites, this means really long-running experiments.
Invest in a good experimentation solution that comes with an accurate stats engine. Investigate the different statistical calculations it uses, for example, for computing the p-value.
Add A/A tests to your experimentation mix. A/A tests can detect issues with your experimentation solution and help iron out many issues that can lead to type I and type II errors.

The bottom line on type I and type II errors

While you can't eliminate type 1 and type II errors, you can significantly lower their chances of hurting your experimentation.

Doing so is essential, as routinely running into type I and type II errors can derail your optimization program. Both types of errors can make you question your entire optimization process and lead you into untested waters.

If you set a good significance level, run adequately-powered tests, work with large sample sizes, and let your experiments run their intended length, you’ll already be in a good position to avoid type I and type II errors. Knowing that you have to keep an eye out for them will also make a big difference. Remember to stay vigilant and follow best practices in all of your experimentation!

Check out our chat with Ronny Kohavi for more on how to avoid these inevitable hypothesis-testing errors.

Topics covered by this article