5
$\begingroup$

I have a quite small data set (on the order of 8-20) from an essentially unknown system and would like to predict a value that will be higher than the next number generated by the same system 90% of the time. Both underestimation and overestimation are problematic.

What is the mathematically "correct" way to do this?

If I could also generate a level-of-confidence estimate, it would wow my manager. Also, let me say I'm not a math major, so thanks for any help, however remedial it may be :)

  • 0
    The Wikipedia article on this topic is worthless, which is surprising because nonparametric prediction intervals have seen wide use in environmental monitoring during the last 20 years. See http://info.ngwa.org/gwol/pdf/912554528.PDF , which includes a sketch of the theory.2019-04-19

4 Answers 4

5

This is where the technique of "Bootstrap" comes in extremely handy. You do not need to know anything about the underlying distribution.

Your question fits in perfectly for a good example of "Bootstrap" technique. The bootstrap technique would also let you determine the confidence intervals. Bootstrap is very elementary to implement on computer and can be done really quick. The typical number of bootstrap samples you take is around $100-200$.

Go through the wiki page and let me know if you need more information on "Bootstrap" technique and I am willing to help you out.

The book by Bradley Efron covers this technique from an application point of view in great detail. The bootstrap algorithm for estimating standard errors is explained on Page $47$, Algorithm $6.1$. You can use this algorithm to construct confidence intervals and finding the quantiles.

  • 0
    will do! thanks.2011-02-15
1

A percentile p can be estimated from a sample of size N by interpolating between sample values.

Consider the "desired rank" given by p(N + 1 ). You can express this number as an integer k plus a decimal part d:

$p(N+1) = k + d$

Then you estimate the percentile as

$Y_{k} + d(Y_{k+1}-Y_k)$

where Yi is the ith largest sample value. (The cases where k = 0 and k = N are exceptions: here you just take Y0 and YN as your estimate.) You can see more details in the NIST Statistics Handbook.

As user17762 pointed out, you can estimate the uncertainty in your estimate of the percentile by bootstrapping. Essentially, you generate a large number of new samples (of equal to size to your original sample) by drawing values from the original sample, with replacement. Each of these new samples gives you a different estimate for the percentile. By looking at the spread in these estimates, you can say something about the uncertainty in your estimate of the percentile.

0

Chances are that the best answer is in some book titled "Non-parametric statistics." I don't know the texts in the field so I can't suggest one.

On the other hand, it's probably not that hard to get close to what you want using a heuristic method, guided perhaps by some experimentation.

If the sample was very large, you could take the 90 percentile point. I would adapt this idea to small sample sizes with the added rule that the prediction would be equal to one of the values in the set.

For set S with N observations sorted in ascending order indexed by i, the rule would be something like this:

For N <=9, prediction = x[9] For N > 9 and N < 20, prediction = x[N-1] For N > 20 and N < 40, prediction = x[N-2]

Note that I'm only suggesting the form of the answer, not the actual rule, and in fact, I'm really only suggesting that you base your prediction on the upper tail of the sample.

0

You should make some assumptions about your distribution, if you are looking into confidence intervals. With 20 data points that is possible. You should first plot your distribution, and then use a chi-square test to see if it fits a hypothesized one (normal, exponential, poisson etc.) Only after you have specified the distribution can you bootstrap. Non-parametrics do not deal with confidence intervals per se, since they are based upon the median and ranked values. However you can use ranked measures such as quartiles or the 90Th decile to describe.

-Ralph Winters

  • 0
    You do not need the underlying distribution. All you need is the empirical distribution to generate bootstrap samples which is just a Uniform distribution over the observed data.2011-02-14