2
$\begingroup$

Consider the set of sequences of zeroes and ones of length $N$ with $k$ ones (or, $Np$ ones where $p = k/N$). We draw randomly and uniformly a sequence from this set.

I want to show that with probability tending to $1$ as $N\to\infty$, there are approximately $kN/2$ (or $Np/2$) ones in the first half of this sequence.

Thank you!

  • 0
    Hmm..I see. If we can assume that $Np$ is always an integer, do you see another simple solution? Moreover, I can't see how it hurts your solution, which I think very general?2012-11-21

2 Answers 2

0

Define $\mathbb S(n,k)$ as the set of sequences of zeroes and ones of length $n$ with $k$ ones. A crucial fact is that one can draw randomly uniformly an element of $\mathbb S(n,k)$ by conditioning the uniform distribution on the set $\{0,1\}^n$.

To be more precise, consider some i.i.d. random variables $(X_i)_{i\geqslant i}$ uniformly distributed on $\{0,1\}$ and $S_n=X_1+\cdots+X_n$. Then, conditionally on $[S_n=k]$, the random element $(X_i)_{1\leqslant i\leqslant n}$ is uniformly distributed on $\mathbb S(n,k)$.

By symmetry, $x=\mathbb E(X_i\mid S_n=k)$ and $y=\mathbb E(X_iX_j\mid S_n=k)$ do not depend on $i\ne j$ between $1$ and $n$. Summing these and using the fact that $X_i^2=X_i$ for every $i$, one gets $x=k/n$ and $ k^2=n\mathbb E(X_i^2\mid S_n=k)+n(n-1)\mathbb E(X_iX_j\mid S_n=k)=k+n(n-1)y, $ hence $ y=\frac{k(k-1)}{n(n-1)}. $ For every $i\leqslant n$, one gets $\mathbb E(S_i\mid S_n=k)=ix$ and $ \mathrm{var}(S_i\mid S_n=k)=ix+i(i-1)y-i^2x^2=z, $ where $ z=\frac{ik(n-i)(n-k)}{n^2(n-1)}. $ Let $\varepsilon$ denote a positive parameter, and $\mathbb S(n,k,i,\varepsilon)$ the portion of the set $\mathbb S(n,k)$ made of sequences of zeroes and ones of length $n$ with $k$ ones such that the initial fragment of length $i$ has less than $(1-\varepsilon)ik/n$ or more than $(1+\varepsilon)ik/n$ ones.

Then the size of $\mathbb S(n,k,i,\varepsilon)$ divided by the size of $\mathbb S(n,k)$ is exactly the probability of the event $[|S_i-\mathbb E(S_i)|\geqslant\varepsilon ix]$, which, by Bienaymé-Chebychev inequality, is $ \mathbb P(|S_i-\mathbb E(S_i)|\geqslant\varepsilon ix\mid S_n=k)\leqslant\frac{z}{\varepsilon^2i^2x^2}=\frac{A(n,k,i)}{\varepsilon^2}, $ with $ A(n,k,i)=\frac{(n-i)(n-k)}{ik(n-1)}. $ In other words, for every $(n,k,i,\varepsilon)$, the proportion of sequences in $\mathbb S(n,k)$ whose initial portion of length $i$ has a number of ones which differs of its mean value $ik/n$ by more than $\varepsilon ik/n$, is at most $A(n,k,i)/\varepsilon^2$.

Now the asymptotics come into play. Assume that $n\to+\infty$, that $k/n\to p$ for some $p$ in $(0,1)$ and that $i/n\to1/2$. One sees that $nA(n,k,i)\to(1-p)/p$ hence $A(n,k,i)\to0$. This proves the desired result, for every positive $\varepsilon$.

  • 0
    Very nice and clean answer! Thank you!2012-11-21
1

Note that really you only care about the first half of the sequence -- if the first half has error at most $\epsilon p N$, then the second half has the same discrepancy. The expected number of $1$'s in the first $N/2$ terms is equal to $Np/2$, which we'll denote by $\mu$. You want to bound the probability that $|\sum_{i=1}^{N/2} x_i - \mu| \geq \epsilon Np = 2 \epsilon \mu$

In this form the situation looks a lot like the ones on which we can use the Chernoff Bound (see in particular the "simplify it to a weaker bound" mentioned on page 2). But there's a slight catch -- the $x_i$ are in this case not independent due to the requirement that the whole sequence contains $pN$ $1$s.

However, the correlation in a sense goes in the "right direction": If the sum up to $k$ is unusually large, that makes it more likely that the next term will be $0$ (because there's an unusually large number of $0's$ remaining), which will tend to drive things back towards the mean. More formally, this is referred to as being "negatively associated". Intuitively, negative association should be a good thing for us -- it makes it harder to get a sum far away from the mean.

This turns out to be true -- the same Chernoff bounds that hold for independent variables also hold for any set of negatively associated variables (for more on this, see the beginning of this survey by Dubhashi and Ranjan). So we can just apply the Chernoff bounds directly, which immediately give a bound exponentially small in $N$ for fixed $p$ and $\epsilon$.


Actually, it's not even the full strength of negative association which is needed here. When you have a sum of indicator variables, all you need is negative correlation: That for any $S$ and any $i \notin S$, we have $P(x_i=1 | x_j=1 \forall j \in S) \leq P(x_i=1),$ which can be easily checked to hold for your model.
For the proof of Chernoff to go through, the important fact is that for positive $t$, we have to have $E(\prod e^{tx_i}) \leq \prod E( e^{t x_i}).$ To see that this holds for negatively correlated variables, note that: \begin{eqnarray*} E(\prod_{i=1}^n e^{tx_i}) &=& E\left(\prod_{i=1}^n (1+(e^t-1) x_i)\right) \\ &=& \sum_{S \subseteq \{1, \dots N\}} (e^t-1)^{|S|} P(x_i=1 \textrm{ for all } i \in S) \textrm{ (by the binomial theorem)} \\ &\leq& \sum_{S \subseteq \{1, \dots N\}} (e^t-1)^{|S|} p^{|S|} \textrm{ (using negative correlation) }\\ &=& \left(1+p(e^t-1)\right)^N \textrm{ (recombining using binomial theorem)}\\ &=& \prod_{i=1}^N E(e^{tx_i}) \end{eqnarray*}

  • 0
    Very nice..Th$a$nk you!2012-11-22