1
$\begingroup$

$\DeclareMathOperator \Cov {Cov}$ $\DeclareMathOperator \Var {Var}$ $\DeclareMathOperator \E {E}$

Consider the following experiment:

For $N\geq1$, consider $N$ black balls. Let us paint each black ball green with probability $\lambda_G/\lambda_N$, where $\lambda_x$ is a Poisson rate parameter with $\lambda_N = \lambda_G + \lambda_B$. Now repaint each green ball red with probability $\lambda_R/\lambda_G$ with $\lambda_G = \lambda_D + \lambda_R$.

Let $S$ be the proportion of black balls painted green and $E$ the proportion of green balls painted red. A previous post showed that $E$ and $S$ are independent with covariance zero.

Now suppose we repeat the above experiment $T$ times randomly painting balls with a fixed $N$ (e.g., $N = 10$), keeping track of each $(E_t,S_t)$.

After obtaining all $(E_t,S_t)$ pairs, we compute the Pearson product-moment (linear) correlation coefficient

$ \rho_m = \frac{\Cov(E,S)}{\sqrt{\Var(E)\Var(S)}} = \frac{\E[(E - \mu_E)(S - \mu_S)]}{\sqrt{\Var(E)\Var(S)}}$

As $T \rightarrow \infty$, we expect $\Cov(E,S) \rightarrow 0$ and hence, $\rho \rightarrow 0$. But for $T$ small, we might expect, by chance, $\rho_m$ values which are close to -1 or +1. And if we then use re-sampling techniques (e.g., permutation tests) to construct confidence intervals and extract $p$ values, we might find some windowed correlations are significant.

Simulated experiments as outlined above yield $\rho_m$ distributions as shown below:

Windowed correlation simulations: correlation versus bootstrapped significance

Windowed correlation simulations: correlation versus <span class=T, the number of experimental repeats">

In the Correlation versus NumWindows plot, $numWindows = T$, the number of experimental repeats, with $N = 10$. Correlations with $p$ values less than 0.15 are plotted in green and with $p$ values less than 0.05 in red. As expected, as $T$ increases, we observe fewer 'extreme' $\rho_m$ values and fewer 'significant' correlations.

My question is then as follows: suppose I know of specific cases in which $E$ and $S$ are correlated (either positively or negatively) due to factors other than chance. Not knowing beforehand which experiments will show significant correlations not due to chance, I want to identify those experiments showing such correlation, separating them from experiments in which $E$ and $S$ are in fact independent.

(1) Given the prevalence of false positives (significant correlations due to chance), how can I find and remove those experiments showing correlations not due to chance?

(2) Additionally, does the $\rho_m$ distribution for windowed correlation have an analytical expression? And could such an expression be used to determine if the observed set of experiments exceeds the expected distribution of correlations arising from chance alone?

(3) What if, instead of $T$ experiments, I simply have $N = 200$ and I proceed to slide a window $n = 10$ along the ordered balls, calculating $E$ and $S$. But instead of 'non-overlapping windows', I allow the windows to overlap, such that I slide the window by 1 ball (numWindows = N - n + 1). How would this change the distributions/would any other factors concerning the $\rho_m$ values need to be taken into account? (beside more windows means the spread in $\rho_m$ is reduced)

  • 0
    @MichaelChernick And also, even if distinguishing true correlations from false positives is not possible, would I be able to say anything about the global picture? For example, after conducting $m$ experiments, would I be able to say this population of experiments has a greater proportion of significant correlations beyond which could be attributed to chance alone, doing so in a statistically robust way? Knowing this would also be informative.2012-07-26

0 Answers 0