1
$\begingroup$

Say I have an infinite collection of $0/1$ random strings of length $n$ (i.e., say 010001110101), where each digit is an independent Bernoulli RV, with parameter $p_i$, $i:1...n$. Define the "trait value" of a string to be the number of "1"s. Define $X$ to be a RV representing the absolute difference between the trait values of a sample of two strings. Define $Y$ to be a RV representing the "distance" between two sampled strings, where this distance is simply the number of mismatched digits (i.e., so 01101-00111=2), also known as the Hamming distance.

Are $X$ and $Y$ correlated? If so, given $n$ and $p_i$, $i:1...n$, what is the covariance?

  • 0
    When you speak of the *difference between the trait values*, are you interested in the *signed* difference or the *absolute* difference? Also, how familiar are you with the binomial and Bernoulli distributions?2012-05-26
  • 0
    The absolute difference (just corrected). Indeed, the 0/1 digits are Bernoulli RVs, but the trait value (X) and the distance (Y) are not binomial...2012-05-27
  • 0
    The 0/1 digits of the strings are Bernoulli RVs with parameters p1,p2,...,pn. This defines the global distribution of the strings.2012-05-27
  • 1
    Certainly not, this only confirms that the digits are 0/1 valued (which we already knew). What is missing is a description of their *dependency*.2012-05-28
  • 0
    As Didier says, you need to specify if the bits are independent or not. BTW, your Y is commonly called the Hamming distance.2012-05-28
  • 0
    Yes, sorry for not being specific. The bits (Bernoulli RVs) are independent.2012-05-28

1 Answers 1

1

Thus, $X=\left|\sum\limits_iZ_i\right|$ and $Y=\sum\limits_i|Z_i|$ where the random variables $(Z_i)$ are independent and $Z_i=0$, $+1$ or $-1$ with probability $1-a_i$, $\frac12a_i$ and $\frac12a_i$ respectively, with $a_i=2p_i(1-p_i)$.

The covariance of $X$ and $Y$ is not easy to compute but the covariance of $X^2$ and $Y$ is. To begin with, $\mathrm E(Z_i)=0$ and $\mathrm E(Z_i^2)=a_i$ hence $\mathrm E(X^2)=\sum\limits_ia_i$. Likewise, $Z_i^2=|Z_i|$ with full probability hence $Y=\sum\limits_iZ_i^2$ and $\mathrm E(Y)=\sum\limits_ia_i$. Finally, $$ X^2Y=\sum\limits_iZ_i^4+2\sum\limits_{i,j}Z_i^2Z_j^2+2\sum\limits_{i,j}Z_i^3Z_j+2\sum\limits_{i,j,k}Z_i^2Z_jZ_k, $$ where the sums are over $i$ and $j$ distinct and over $i$, $j$ and $k$ distinct. Since $Z_i^4=|Z_i|$ with full probability, $$ \mathrm E(X^2Y)=\sum\limits_ia_i+2\sum\limits_{i\ne j}a_ia_j, $$ and finally, $$ \mathrm{Cov}(X^2,Y)=\sum\limits_ia_i(1-a_i)=\sum\limits_i2p_i(1-p_i)(1-2p_i(1-p_i)). $$ Edit: For each $i$, introduce $$ A_i=\mathrm E\left|\sum\limits_{j\ne i}Z_j\right|,\qquad B_i=\mathrm E\left|1+\sum\limits_{j\ne i}Z_j\right|. $$ Decomposing the expectations along the values of $Z_i$, one gets $\mathrm E(X\cdot|Z_i|)=a_iB_i$ and $\mathrm E(X)=a_iB_i+(1-a_i)A_i$. By convexity, $B_i\geqslant1+A_i$ hence $$ \mathrm E(X)\mathrm E(|Z_i|)=a_i(a_iB_i+(1-a_i)A_i)\leqslant a_iB_i-a_i(1-a_i)=\mathrm E(X\cdot|Z_i|)-a_i(1-a_i). $$ Summing up, this yields $$ \mathrm{Cov}(X,Y)=\sum_i\mathrm E(X\cdot|Z_i|)-\mathrm E(X)\mathrm E(|Z_i|)\geqslant\sum\limits_ia_i(1-a_i)\gt0. $$

  • 0
    Thanks much. So we get that $\mathrm{Cov}(X^2,Y)>0$, which is rather intuitive. It follows that $\mathrm{Cov}(X,Y)>0$ as well, right? Also, I am still quite interested in finding how strong the correlation would be, so to find $\mathrm{Corr}(X,Y)$.2012-05-28
  • 0
    Thanks again... so we now have a lower bound on the covariance, and in extension on the correlation.2012-06-09
  • 0
    Any way I could get [a] a good upper bound on $Cov(X,Y)$, and [b] the lower/upper bound on $r(X,Y)$, the correlation coefficient?2012-07-23
  • 0
    I went back to this problem (actually using the proof for a paper in genetics) and it seems the inequality in your "By convexity $B_i\geqslant1+A_i$" should be the other way around. Please let me know what you think.2015-07-06
  • 0
    Please see my recent concern, @Did...2015-07-07