1
$\begingroup$

My background is not really in mathematics, so please bear with me!

Also: I'm fairly sure this has been studied in literature, but I have no idea how to find it, or even what's its name. It shouldn't be too far from Gaussian Binomial Coefficients, I guess...

Problem statement:

I have a few computer nodes in a network that share information between them.

Every few seconds, a new round starts: each node will randomly select another node in the network, and ask him for what he knows (the information is pulled from that node).

I am interested in knowing how much time it takes for a single piece of information to travel through the network. This can be estimation of the average number of rounds, or the 95th percentile, etc.

A few assumptions:

  1. For simplicity, I assume a synchronous model: everyone starts the round at exactly the same time, and can only pull information acquired from previous rounds.
  2. Nodes do not forget! Once they pull the information, they will keep it.
  3. The set of nodes doesn't change, and every node can contact every other node with equal probability.

Notation:

  • $N$ is the total number of nodes in the network
  • $N-1$ is the number of nodes that a given node can talk to in a given round.
  • $k_0$ is the number of nodes that initially know about the piece of information.
  • $k_i$ is the number of nodes that know about the piece of information after $i$ rounds.
  • $N-k_i$ is the number of nodes that remain clueless about the piece of information about $i$ rounds.

Rationale:

Things start off quite easily: we can see the problem from the point of view of unknowledgeable nodes randomly choosing to pull information from a knowledgeable node.

  • $P($choosing a knowledgeable node$)=\frac{k_i}{N-1}$
  • $P($choosing a clueless node$)=\frac{N-k_i-1}{N-1}$

Then I started checking probabilities by changing the number of rounds, and it resembles a binomial distribution:

Probability of everyone knowing after 1 round:

$\left(\frac{k_0}{N-1}\right)^{N-k_0}$

Probability of everyone knowing after 2 rounds:

$\sum_{k_1=k_0}^{N}\left[\left(\frac{k_0}{N-1}\right)^{k_1-k_0}\left(\frac{N-k_0-1}{N-1}\right)^{N-k_1}\right]$

Generalizing -- for r rounds:

$\sum_{k_1=k_0}^{N} \left[\left(\frac{k_0}{N-1}\right)^{k_1-k_0}\left(\frac{N-k_0-1}{N-1}\right)^{N-k_1}\sum_{k_2=k_1}^{N} \left[\left(\frac{k_1}{N-1}\right)^{k_2-k_1}\left(\frac{N-k_1-1}{N-1}\right)^{N-k_2}\sum_{k_3=k_2}^{N} \\ \ldots \\ \sum_{k_i=k_{i-1}}^{N}\left[\left(\frac{k_{i-1}}{N-1}\right)^{k_i-k_{i-1}}\left(\frac{N-k_{i-1}-1}{N-1}\right)^{N-k_{i}} \sum_{k_{i+1}=k_i}^{N}\left[\left(\frac{k_{i}}{N-1}\right)^{k_{i+1}-k_i}\left(\frac{N-k_i-1}{N-1}\right)^{N-k_{i+1}}\sum_{k_{i+2}=k_{i+1}}^{N} \\ \ldots \\ \sum_{k_{r-1}=k_{r-2}} \left[ \left(\frac{k_{r-2}}{N-1}\right)^{k_{r-1}-k_{r-2}}\left(\frac{N-k_{r-2}-1}{N-1}\right)^{N-k_{r-1}} \left(\frac{k_{r-1}}{N-1}^{N-k_r-1}\right) \right] \ldots \right] \right] \ldots \right] \right] $

Simplifying you get:

$\sum_{k_1=k_0}^{N}\sum_{k_2=k_1}^{N}\sum_{k_3=k_2}^{N}\ldots\sum_{k_{r-1}=k_{r-2}}^{N} \left(\prod_{i=1}^{r}\left[\left(\frac{k_{i-1}}{N-1}\right)^{k_{i}-k_{i-1}}\left(\frac{N-k_{i-1}-1}{N-1}\right)^{N-k_{i}}\right] \left(\frac{k_{r-1}}{N-1}^{N-k_{r-1}}\right)\right) $

I would then like to be able to answer stuff like what's the probability everyone knows in less than r rounds.

Questions

  • Any alternate way to tackle the problem that makes things easier?
  • Any useful identities/etc that I should know about? Currently it looks intractable to me!
  • Is there any theory behind this? This looks generic enough, and looks like it has been studied tons of times before, appearing in all sorts of problems. I would really like to know the name!
  • I've been told this is a hypergeometric distribution, rather than a binomial. But I would state it is still not quite! Happy to be proven wrong, though.
  • 1
    Just a clarification, I am not sure I understand how information is handled. What throws me off is "can only pull information acquired from previous rounds". My understanding is that at the beginning only some nodes "know". When node that does not "know" pulls information, then it learns it and "knows" it from then on (namely does not forget, the information does not disappear if another node pulls it etc.). Am I correct?2017-02-20
  • 1
    May I suggest "network" and "network flow" tags?2017-02-20
  • 0
    That is correct! I'm also assuming nodes don't forget :-) I'll also add those tags! Thank you!2017-02-20
  • 0
    Re-reading your comment, what I meant is that if a node $n_1$ learns the information on round $r_i$, if $n_2$ pulls from $n_1$ on round $r_i$, it won't learn the information -- someone learning from $n_1$ needs to pull from a round later than $r_i$... I don't think I made it clearer here :P sorry!2017-02-20

1 Answers 1

1

If I'm understanding the problem setup correctly, it goes like this: at each time, $k$ computers know the information. The other $N-k$ computers each query a computer at random, and if they happen to query a computer that knows the information, they come to know it too. And two computers can query the same computer and can query the same computer more than once over the course of the process. That means that the new value of $k$, call it $k'$, is distributed as $k+\text{Bin}(N-k,k/(N-1))$. (The $N-1$ is because the computer will never try to communicate with itself.) After that the process is repeated, and the only thing that matters anymore is $k'$.

(Let me know in a comment if I misunderstood the problem, and I'll edit the answer.)

The behavior of the process depends only on the present state, so this is a Markov chain. To avoid uninteresting technicalities let's assume that the state space is $\{ 1,2,\dots,N \}$ (i.e. we exclude $0$, even though technically the underlying process makes sense when started at $0$). It has a single absorbing state, namely $N$, and the time to reach this state, call it $\tau$ has some distribution. Let's say we want to find the actual distribution. If we denote the transition probability matrix by $P$ and its elements by $p_{ij}$, we have from the total probability formula:

$$P(\tau=t \mid k_0=i)=\sum_{j=1}^N P(\tau=t-1 \mid k_0=j) p_{ij}.$$

We then adjoin the boundary condition $P(\tau=0 \mid k_0=i)=\begin{cases} 1 & i=N \\ 0 & \text{otherwise} \end{cases}$. In the process we get a recurrence relation that can be used to find $P(\tau=t \mid k_0=i)$ for any $t,i$. Specifically if we introduce the vector $v=\begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}$ to represent the boundary condition, then $P(\tau=t \mid k_0=i)=(P^t v)_i$.

The expected value of this distribution can be found more easily than this. Define $P'$ to be $P-I$, except that the $N$th row is replaced with $\begin{bmatrix} 0 & 0 & \dots & 1 \end{bmatrix}$, and define $b$ to be the vector $\begin{bmatrix} -1 \\ -1 \\ \vdots \\ -1 \\ 0 \end{bmatrix}$. Then the function $u(x)=E[\tau \mid k_0=x]$ satisfies the linear system

$$P'u=b.$$

That last thing can be easily done with the following Matlab/Octave code:

N=10; %for example
b=-ones(N,1);
b(end)=0;
P=zeros(N,N);
for i=1:(N-1)
  P(i,i:end)=binopdf(0:(N-i),N-i,i/(N-1));
  P(i,i)=P(i,i)-1;
end
P(end,end)=1;
u=P\b;
  • 0
    Thank you so much for all the patience and time dedicated! Sorry, it's taking me quite a long time to go through all the information :) This was the first time I heard about Absorbing Markov Chains and it takes time to sink in. My initial formulation had three parameters ($N$, $k0$, $r$) -- this gives a solution considering arbitrary $r$ and $k0$, which is awesome.2017-02-22
  • 0
    A tiny mistake in the new value distribution is $k$+Bin($N-k$,$k/(N-1)$), as the node cannot communicate with itself.2017-02-22
  • 0
    @JoãoNeto Ah, yes that makes sense, I'll edit that in.2017-02-22