Basic Statistics Concepts: Explain Like I Am Five (Maybe Fifteen)

2018-01-02

I was entitled to give intuitive explanations to a friend who is trying very hard to study intro Statistics concepts. Statistics can be confusing, even at intro level. It is important to get the basic ideas straight, and almost always helpful to stick to your questions until you are absolutely clear. Here I am posting some of my explanations, hoping to help those who are troubled by similar questions.

Background

Suppose you want know the mean of students' heights for this year's graduating class. The whole graduating class is your population, and it has mean height $μ$ and standard deviation of height $σ$ . However you cannot collect data for everybody in the graduating class. Instead you only collect heights data for a sample, which consists of $n$ students. The sample has mean height $\bar{X}$ and standard deviation of height $σ (X)$ . Each time you collect a sample, you get a different mean and standard deviation. Therefore both $\bar{X}$ and $σ (X)$ are Statistical variables having their own mean and standard deviation, while $μ$ and $σ$ are just constants. You want to know the relationship between your sample mean $\bar{X}$ and population mean $μ$ , as well as that of your sample sd (standard deviation) $σ (X)$ and population sd $σ$ .

How do you estimate population mean $μ$ from sample mean $\bar{X}$ ?

The short answer is you collect a student sample of some size

n

, take their average height (

\bar{X}

), and assert that you think

μ = m

. Why can you make such an assertion? Because the sample mean is an unbiased estimator of the population mean, or, simpler, you expect your sample mean to be the same as the population mean, although you realize you may have a bad sample. How so? Suppose student height is a Statistical variable

X

with mean

μ

and sd

σ

, you have independent and identically distributed (iid)

X_{i} \sim X

with

i = 1 \dots n

. That is, for the purpose of this problem, they are independent and their mean are all

μ

and their sd's are all

σ

E (\bar{X}) = E (\frac{\sum_{i = 1}^{n} X_{i}}{n}) = \frac{1}{n} \cdot (\sum_{i = 1}^{n} E (X_{i}))

Let's stop here and examine why. The second step just substitute numbers to the first step. However it should be realized that the third step is true because the expectation

E

is a linear operator. It means

E (a X + b Y) = a \cdot E (X) + b \cdot E (Y)

where

a

and

b

are two constants and

X

and

Y

are two Statistical variables. Now it should be clear why the thrid step is true.

\frac{1}{n} \cdot (\sum_{i = 1}^{n} E (X_{i})) = \frac{1}{n} \cdot (n μ) = μ

Voila! We see

E (\bar{X}) = μ

. That is, again, the sample mean is expected to be the same as the population mean.

What is $σ (\bar{X}) = σ / \sqrt{n}$ ?

Now you ask yourself, what if I there are 1000 students in this graduating class, but I am only taking 4 students in my sample? Is my estimate still unbiased? Is the above still true? Yes, although you have a greater possibility of having a bad sample, the estimator (the way you make the estimate by taking the sample mean) is still good and the estimate is still unbiased. An unbiased estimator may be bad if the variance of the estimator is too large. Then how can you shrink the variance? The answer is by taking a large sample. Intuitively this makes sense, the larger the sample, the extreme data are more likely to be compensated by the other extreme when you take the mean. Mathematically speaking,

Var (\bar{X}) = Var (\frac{\sum_{i = 1}^{n} X_{i}}{n}) = \frac{1}{n^{2}} (\sum_{i = 1}^{n} Var (X_{i}))

The third step is true since for two independent Statistical variables

X

and

Y

Var (a X + b Y) = a^{2} \cdot Var (X) + b^{2} \cdot Var (Y)

Notice that our

X_{i}

's are all independent and identically distributed. Therefore we further have,

\frac{1}{n^{2}} (\sum_{i = 1}^{n} Var (X_{i})) = \frac{1}{n^{2}} \cdot n σ^{2} = \frac{σ^{2}}{n}

Take square root to both sides,

σ (\bar{X}) = \frac{σ}{\sqrt{n}}

Why sample variance divides by $n - 1$ instead of $n$ ?

Let's continue our story. Now you know you need a large sample to get an unbiased estimator for the mean heights with a low variance. Now you are becoming more ambitious, you want to estimate the population sd $σ$ . Is you sample variance an unbiased estimator of the population variance?

(TBC)

Basic Statistics Concepts: Explain Like I Am Five (Maybe Fifteen)

Background

How do you estimate population mean μ from sample mean X¯?

What is σ(X¯)=σ/n?

Why sample variance divides by n−1 instead of n?

How do you estimate population mean $μ$ from sample mean $\bar{X}$ ?

What is $σ (\bar{X}) = σ / \sqrt{n}$ ?

Why sample variance divides by $n - 1$ instead of $n$ ?