I asked a question in the statistics stack exchange about "Error of generalized classifier performance" https://stats.stackexchange.com/questions/41400/error-of-generalized-classifier-performance :
I am working on a problem where it is expensive to label data and I have sampled a small subset of the available data and labeled it.
My classifier is a binary classifier that I use with the hope of removing samples that are "false" but keep samples that are "true".
My question is: how well does the classifier performance generalize to the full population?
Some numbers:
I sample 500 data records and label them (uniform sample).
True | False 180 | 320
If I assume a binomial distribution I can calculate (e.g. using R) a confidence interval as such:
> binom.confint(180,500, conf.level = 0.95, methods="exact") method x n mean lower upper 1 exact 180 500 0.36 0.317863 0.4038034
If I then use a classifier and the performance is the following:
=== Confusion Matrix === a b <-- classified as 150 30 | a = related 33 287 | b = unrelated
I know how I can calculate precision and F measure from this result, but what is the error?
How well does this generalize to the whole population? Can we make any error estimation with the help of the above confidence interval of the binomial distribution?
I have been doing some calculations where I assume that the true-positive and false-positive rates from the classification hold in general. But I know that this is more or less a back-of the envelope type estimation.
But got no answers.
I can reformulate the question a bit and I hope someone can lead me in the right direction.
I have a classifier (a black box) that assigns a label to each record. What can I say about the error here?
In my data mining book they talk about estimating accuracy with the binomial distribution. But what I am interested in is estimating the true positive and false positive rates.
\ Classifier Data T F T p_tp F p_fp
I have sampled 500 records that I have annotated with class labels. I can estimate the underlying distribution using the binomial to get a confidence interval for p(d=t). Here d stands for data and t for true.
But what about:
p( c(d)=t | d=t ) //true positive rate p( c(d)=t | d=f ) //false positive rate
My education in statistics is regrettably limited (5 weeks course at uni). But I still want to see a solution even if I (maybe) can't understand it.