A Simple Problem On Inferencing From Censored Data

2019-03-18

The Problem

This problem can have multiple formulations, a train that arrives precisely per T minutes and you measure all of your wait times; a ruler that is of length L and you keep track of all objects that can be measured; a bug that jumps out of a glass of height H out into the hot water so it only remembers the unsuccessful jumps, etc. The goal is the same: how would you estimate that unknown (T, L, or H) based on your data, say you have N measurements?

Solution

This question requires some rigorousness to do it right, but intuition matches with that rigorousness very well. Obviously the maximum of your observations, or the Nth order statistics, X(N) is an estimator of the unknown, but you know it is always going to under estimate the truth. We need to correct this slightly larger so that we can have an unbiased estimator of the truth. Say your observations are from an uniform distribution (you leave anytime for the train, the length of the objects that your are measuring are uniformly distributed, or the bug jumps uniformly distributed heights). Then intuitively, E(X(N))=NN+1T Then, T=N+1NX(N),E(T)=T So here we have an unbiased estimator of T. The problem is the variance of this estimator is probably going to be large and at this point you gut tells you variance will decrease as sample size increase. So let's try a more rigorous way. Say your observations are i.i.d. from a distribution f(x) with CDF F(x). Then the distribution of the jth statistics is fXj(x)=(Nj)jf(x)F(x)j1(1F(x))nj Let's say we have an uniform distribution as assumed previously on [0,T], f(x)=1/T, F(x)=x/T. Then the distribution of the jth statistics is fX(j)(x)=1B(j, Nj+1)(xT)j1(1xT)(Nj+1)11T Say Y=X/T, then YBeta(j, Nj+1). From results of the Beta distribution, and linearity of expectation we have E(X(N))=NN+1TandVar(X(N))=T2N(N+1)2(N+2) Establishing the previous intuitions. But here are some further twists: what is the percentage of time that this estimator will be under estimating the truth? And how does this percentage change with N? This is basically, limNFY(NN+1)=limN(1(1+1N)N)=1e That's about 36.8% of time. To simulate, R> N = 1e5 R> pbeta(1-(1/(N+1)), N, 1) # [1] 0.3678813 R> set.seed(123) R> sum(replicate(5000, max(runif(N))*(N+1)/N < 1)) # [1] 1826 R> 1826/5000 # [1] 0.3652 © The Responsible Adult 2017 - 2025