It's been a while since my last statistics class...
I have 404 files that went through some automated generation process. I would like to manually verify some of them to make sure that their data is indeed correct. I want to use probability to help me out so that I don't need to check every single file.
How would I calculate what sample size I should use to reach a certain confidence level?
For example, if I would like to say with 95% confidence that the files are correct, how many of them do I have to check?
I found an online calculator, but I'm not entirely sure what I should put for the confidence interval. Say I put 20% and leave the confidence factor at 95%. I get a sample size of 23. Let's say now that I tested 23 random files and all of them were fine. Does that mean that "I can be 95% confident that 80% to 100% of the files are correct"?
Does this mean, then, that for my original question, I would need to use a 99% confidence level with a 4% confidence interval, then I would need to verify that the 291 files (the sample size it gave me) are all correct. And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)
It also mentions something about percentages which I'm not quite clear on... does the fact that most (i.e. 100%) of the files I test are valid (since if I found an invalid one, I would stop the whole process and examine my generation process for errors) mean that I can use a smaller sample to get the same confidence factor? If so, how would I calculate it?
