1
$\begingroup$

Suppose we have many identical boxes, too many that we cannot open all of them.

Each box has finitely many balls. The number of balls in each box can be very large, but is finite. Balls are identical but different in color.

The colors include red, white, blue, black, yellow, etc. We do not know the total number of colors (finite) in advance. Each color can be tagged as dark, light and intermediate. Although we do not know how many colors are there in advance, given a new color, a fixed oracle can classify it into one of {dark, light, intermediate}.

To summary: a box has several balls, each ball has a color, and each color can be classified as one of {dark, light, intermediate}.

The QUESTION is: how to SAMPLE from the population of boxes, and to ESTIMATE ("the total number of colors, as well as" DELETE, see UPDATE below) the proportion of dark, light and intermediate?

(UPDATE: it's unlikely to estimate the number of colors since in this problem, if we don't open all the boxes, many colors will be missed, and we have no extra information to know that.)

It would be ideal if someone can refer some textbooks or papers on this problem to me.

Example & explanation: 3 boxes arranged as {red,blue,white,black}, {red,blue,yellow}, {red}.

Suppose the oracle tells us that the colors are classified as {dark: black}, {light: white, yellow}, {intermediate: red, blue}

The total number of colors is 5, and the dark-light-intermediate ratio is 0.2:0.4:0.4. Although there are one white ball and one yellow ball while three red balls and two blue balls, they contribute equal in terms of distinct colors, and hence the proportion of light/intermediate.

PS: I think this problem is more related to combinatorics than statistics, and hence post it here instead of Cross-Validate website.

  • 0
    Statistics consists largely of the use of probabilistic methods (including combinatorics) to estimate quantities. You are trying to estimate quantities using combinatorial (but really probabilistic) reasoning. It seems to me this is a statistics problem. (I'm not telling you where to post it, however.)2017-01-17
  • 0
    I think you should also post to Cross-Validated, since your are asking how to design a sampling scheme and estimators for various quantities (number of colors, proportions of color classes, etc.). Also, I do not see how the boxes matter in the problem. Are you choosing entire boxes? How does the sampling work? I'm not entirely sure, but since you have not specified a lot of information, a Bayesian approach may be helpful.2017-01-17
  • 0
    I'm not sure there is a good answer without some a priori idea about the likelihood of colors. Is it possible there is just one yellow ball in all of the boxes? Or do we have good reason to believe that if there are both red and yellow balls, the number of red balls will be not too different from the number of yellow balls?2017-01-17
  • 0
    Compare this to the problem of estimating how many different species of living organisms currently exist on Earth. That is a difficult problem, and various published studies give estimates that do not even come close to agreeing on the order of magnitude of the number.2017-01-17
  • 0
    @DavidK It's possible that there is just one yellow ball. The number of red balls and yellow balls are independent.2017-01-18
  • 0
    @angryavian Suppose we have 1 billion boxes, and the budget restricts that we can open at most 1k boxes. Since all the boxes are identical, it seems that we can only use uniform sampling.2017-01-18
  • 0
    If you open 1000 out of 1 billion boxes, the chance that you'll be aware of a single yellow ball is 1 in a million. If there are a thousand "singleton" colors that you can only discover by finding the single ball of that type, your chance to know that _any_ of them exist is less than 1/1000. So maybe you come up with an answer like, "There are at least 113 colors (because I've seen each of them), and I'm confident there are no more than three million colors (because if there were _that_ many, chances were 95% that I would have seen at least one more color)."2017-01-18
  • 0
    @DavidK Your are right in that some color is likely to be missed if we do not open all the boxes. However, we have a goal of estimating the distribution of the color type. Without more information, we can only assume that three color types are ignored with the same proportion (max entropy principle).2017-01-20

1 Answers 1

0

Good-Turing Estimator can be used in this scenario.