4
$\begingroup$

Let's say we have a parameter $r$ and a binary event $A$ repeatedly happens. The event is binary, so the outcome is either $0$ or $1$. We have collected a lot of data of the form $\{\{r_1,A_1\},\{r_2,A_2\},\cdots,\{r_n,A_n\}\}$ where $r_i\in\mathbb{R}$ and $A_i\in\{0,1\}$.

For example: $\{\{-3,0\},\{-2,1\},\{2,1\},\{2,1\},\{1,0\}\}$

Can we somehow estimate the probability of $A$ being $1$ for a certain $r$. From the example data, it seems when $r=2$ that $A=1$ quite certainly. But the data sample is very very large and I'm totally at a loss at how to estimate this probability. When there are a lot of positive outcomes for certain values of $r$ than that increases the probability of a positive outcome for other values close to $r$.

How can all this be accumulated in order to predict (and how confidently) the probability of a positive outcome once we set an arbitrary $r$?

  • 0
    About 130000 $\{r,A\}$-pairs.2011-10-06

1 Answers 1

4

Your data follow a bivariate distribution with one variable being either 0 or 1. So you should filter your data into $r_{i|0}$ and $r_{i|1}$, and either construct histograms for each type, or perform smooth kernel density estimation, assuming that $r$ follows a continuous distribution, or perform parameter fitting for a suitable family of parametric distributions.

Assume that that was done and frame you question in probabilistic language. Knowing what you need might help you find the shortcut to the answer.

I know this is a vague answer, but you've got to precise your question for me to try to do better.


Added: Given that OP intends $r$ to be a continuous variable, it is better to condition on a non-measure zero event. For a continuous variable, the probability that it equals $4.23$ is zero. It is better to condition that $r$ is within some, possibly small interval.

Here is Mathematica simulation based on a fictitious dataset:

enter image description here

To do it on your own, count the number of graduating students from your dataset with their grade in the specified range, and divide over the total number of students with grades in that range.

  • 0
    The important point is to choose your conditioning so that it has non-zero probability. If your grades are exactly all 5, you should choose a small interval around 5. It does not matter if it extends a little over the maximal grade, inequalities will still be satisfied, and your filtering will still produce a non-empty sample.2011-10-06