3
$\begingroup$

I asked a question in the statistics stack exchange about "Error of generalized classifier performance" https://stats.stackexchange.com/questions/41400/error-of-generalized-classifier-performance :


I am working on a problem where it is expensive to label data and I have sampled a small subset of the available data and labeled it.

My classifier is a binary classifier that I use with the hope of removing samples that are "false" but keep samples that are "true".

My question is: how well does the classifier performance generalize to the full population?

Some numbers:

I sample 500 data records and label them (uniform sample).

True | False 180 | 320

If I assume a binomial distribution I can calculate (e.g. using R) a confidence interval as such:

> binom.confint(180,500, conf.level = 0.95, methods="exact")   method   x   n mean    lower     upper  1  exact 180 500 0.36 0.317863 0.4038034 

If I then use a classifier and the performance is the following:

=== Confusion Matrix ===   a   b   <-- classified as 150  30 |   a = related   33 287 |   b = unrelated 

I know how I can calculate precision and F measure from this result, but what is the error?

How well does this generalize to the whole population? Can we make any error estimation with the help of the above confidence interval of the binomial distribution?

I have been doing some calculations where I assume that the true-positive and false-positive rates from the classification hold in general. But I know that this is more or less a back-of the envelope type estimation.


But got no answers.

I can reformulate the question a bit and I hope someone can lead me in the right direction.

I have a classifier (a black box) that assigns a label to each record. What can I say about the error here?

In my data mining book they talk about estimating accuracy with the binomial distribution. But what I am interested in is estimating the true positive and false positive rates.

    \ Classifier Data     T   F      T p_tp       F p_fp 

I have sampled 500 records that I have annotated with class labels. I can estimate the underlying distribution using the binomial to get a confidence interval for p(d=t). Here d stands for data and t for true.

But what about:

p( c(d)=t | d=t ) //true positive rate  p( c(d)=t | d=f ) //false positive rate 

My education in statistics is regrettably limited (5 weeks course at uni). But I still want to see a solution even if I (maybe) can't understand it.

1 Answers 1

2

Yes, normally we would use the Normal approximation to the binomial to produce a confidence interval for the true positive or false positive rates.

Wiki has the info here:-

https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

And the formula for the confidence interval is given by $\hat{p}\pm z_{1-\alpha/2}\sqrt\frac{\hat{p}(1-\hat{p}}{n}$ where $\hat{p}$ is the sample proportion of true positives (or false positives, the function is symmetric), and n is the sample size.

Below you are attempting to use Bayes' theorem to determine the probabilities of false positives. This won't directly give you a confidence interval, although Bayesian inference can be used to give you a "credible interval" but that is a subtly different concept that is likely to be confusing if you don't have a strong stats background.

Shout if you need more clarification.

  • 1
    That should help.2012-10-30