Why do we need 30 data samples?

March 6, 2021 bizpi

I saw this question on a forum, and I wanted to give my response. Since the forum is under a membership account, I cannot share the link, so I’ll make it a post.

When teaching statistics, the “rule of thumb” is that you need 30 data points to have a good sample size, but what is the origin of the number 30?

I believe it comes from idea that many data sets fit a normal distribution (which isn’t true, but when you don’t know yet, might as well assume normality). When you have small samples, you should estimate the normal distribution with the Student’s t-distribution. The t-distribution applies a correction factor on your confidence interval until your sample size gets to an “acceptable” level. When you have small samples, you don’t have sufficient data to observe all the extreme values of the distribution yet, so the correction factor attempts to adjust for that.

The t-distribution table looks like this:

For a large sample (that approaches infinity), the values are listed in the last row of the table above. As you scroll through the increasing sample sizes (v) within each column, you’ll see that the correction factor starts to approach the infinite value. As you approach 20, 25, 30, and 40 samples, the correction factor for each additional sample seems to get smaller and smaller until it reaches a set value (when the sample size reaches infinity).

Therefore, after you hit around 30 samples, the additional knowledge you’ll gain for the new data will start to approach diminishing returns. The increasing knowledge about your data will be much greater when you go from 15 to 20 samples than going from 35 to 40 samples.

Therefore, I don’t think there is any magic to 30 samples (compared to 25 or 31 or 33), but it is a good round number to use. In my experience, I also feel that 30 data points creates a more complete picture of a histogram, which allows me to see the underlying distribution of the data. If the data is not normal, you may need less or more samples than 30.

If you cannot at least 30 data points, try to get as many as you can. In aerospace, we would only get 5 prototype units to test. We did the best we could with the sample of 5, but would apply the t-distribution correction factor to increase our standard deviation predictions because of the limited data.

I like to tell my classes that 2 data points are better than only 1, 5 is better than 3, 15 is better than 10, and 25 is better than 20.

You May Also Like

Zero defects does not mean the problem is solved