A Simple Problem On Inferencing From Censored Data

2019-03-18

The Problem

This problem can have multiple formulations, a train that arrives precisely per $T$ minutes and you measure all of your wait times; a ruler that is of length $L$ and you keep track of all objects that can be measured; a bug that jumps out of a glass of height $H$ out into the hot water so it only remembers the unsuccessful jumps, etc. The goal is the same: how would you estimate that unknown ( $T$ , $L$ , or $H$ ) based on your data, say you have $N$ measurements?

Solution

This question requires some rigorousness to do it right, but intuition matches with that rigorousness very well. Obviously the maximum of your observations, or the

N^{th}

order statistics,

X_{(N)}

is an estimator of the unknown, but you know it is always going to under estimate the truth. We need to correct this slightly larger so that we can have an unbiased estimator of the truth. Say your observations are from an uniform distribution (you leave anytime for the train, the length of the objects that your are measuring are uniformly distributed, or the bug jumps uniformly distributed heights). Then intuitively,

E (X_{(N)}) = \frac{N}{N + 1} T

Then,

T^{'} = \frac{N + 1}{N} \cdot X_{(N)}, E (T^{'}) = T

So here we have an unbiased estimator of

T

. The problem is the variance of this estimator is probably going to be large and at this point you gut tells you variance will decrease as sample size increase. So let's try a more rigorous way. Say your observations are i.i.d. from a distribution

f (x)

with CDF

F (x)

. Then the distribution of the

j^{th}

statistics is

f_{X_{j}} (x) = (\binom{N}{j}) \cdot j \cdot f (x) \cdot F (x)^{j - 1} (1 - F (x))^{n - j}

Let's say we have an uniform distribution as assumed previously on

[0, T]

f (x) = 1 / T

F (x) = x / T

. Then the distribution of the

j^{th}

statistics is

f_{X_{(j)}} (x) = \frac{1}{B (j, N - j + 1)} (\frac{x}{T})^{j - 1} (1 - \frac{x}{T})^{(N - j + 1) - 1} \cdot \frac{1}{T}

Say

Y = X / T

, then

Y \sim Beta (j, N - j + 1)

. From results of the Beta distribution, and linearity of expectation we have

E (X_{(N)}) = \frac{N}{N + 1} T and Var (X_{(N)}) = T^{2} \cdot \frac{N}{(N + 1)^{2} (N + 2)}

Establishing the previous intuitions. But here are some further twists: what is the percentage of time that this estimator will be under estimating the truth? And how does this percentage change with

N

? This is basically,

lim_{N \to \infty} F_{Y} (\frac{N}{N + 1}) = lim_{N \to \infty} (\frac{1}{(1 + \frac{1}{N})^{N}}) = \frac{1}{e}

That's about

36.8 %

of time. To simulate,


R> N = 1e5
R> pbeta(1-(1/(N+1)), N, 1)
# [1] 0.3678813
R> set.seed(123)
R> sum(replicate(5000, max(runif(N))*(N+1)/N < 1))
# [1] 1826
R> 1826/5000
# [1] 0.3652