2
$\begingroup$

I have an array with $N$ elements. I want to sample elements pseudo-randomly in the array in a controllable fashion. For example, I would like to sample the elements such that

element $x$ is sampled $K$ times more than element $N$, where

$N$ = Total number of elements
$T$ = Some constant, e.g. 3
$K = 1 + T(1 - x/N)$

In this example, the first element is sampled $\approx 4$ times more than the last one, and the penultimate element is sampled $1 + T(1 - (N-1)/N) \approx 1$ time(s) more than the last one.

  • 0
    Yes, thank you. Much more appropriate.2011-07-01

5 Answers 5

3

General information on pseudo-random number sampling is in this Wikipedia article. For your specific case, we have

$p(X=x)=\frac{1 + T(1 - x/N)}{M}$

with normalization constant

M=\sum_{x'=1}^{N}\left(1 + T(1 - x'/N)\right)\;.

We can calculate

\begin{eqnarray} \sigma_x &:=& \sum_{x'=1}^{x}\left(1 + T(1 - x'/N)\right) \\ &=& (1+T)x-\frac TN \frac{x(x+1)}{2}\;, \end{eqnarray}

with $M=\sigma_N=(1+T)N-T(N+1)/2=N+T(N-1)/2$. Thus the cumulative distribution function is

\begin{eqnarray} p(X\le x) &=& \sum_{x'=1}^xp(X=x') \\ &=& \frac{\sigma_x}{M} \\ &=& \frac{(1+T)x-\frac TN \frac{x(x+1)}{2}}{N+T(N-1)/2} \end{eqnarray}

To generate a pseudo-random number according to this distribution, you can proceed as in Searke's general answer: Generate a pseudo-random number $r$ uniformly distributed over $[0,1]$ and choose the least value $x$ for which $r\le p(X\le x)$. To operationalize this, you have three options; which one to choose depends on your priorities (complexity, time, space).

The easiest is to just iterate over $x$ from $1$ to $N$ until $r\le p(X\le x)$. This is of course rather inefficient for large $N$.

A first optimization would be to perform a binary search for $r$ in an array of the $p(X\le x)$. That reduces the time complexity from linear in $N$ to logarithmic in $N$, and it has the advantage that you can do it for any distribution function without having to be able to solve for $x$.

In the present case, however, the equation $r=p(X\le x)$ is quadratic in $x$ and thus can be solved for $x$. Thus, if $N$ is sufficiently large, the most efficient way to proceed is to solve $r=p(X\le x)$ for $x$ and then choose the least integer greater or equal to the solution.

1

If you want to simulate a discrete random variable (r.v.) $X$, taking the values $x_1, x_2, \ldots, x_n$ with probabilities $p_1, p_2, \ldots, p_n$, $\sum_{i=1}^np_i=1$, then you must define an array (vector) $F$ of elements $F_1=p_1, F_2=p_1+p_2, \ldots, F_k=F_{k-1}+p_k, \ldots F_n=1.0$. Generating a pseudorandom number $u\in[0,1)$ as an observation on a r.v. $U\sim Unif[0,1)$, the probability that $u$ belong to an interval $I_k=[F_{k-1}, F_k)$ is equal to the length of that interval, i.e. $F_k-F_{k-1}=p_k$. Thus the event $(U\in[F_k, F_{k-1}))$ has the same probability, $p_k$, as the event $(X=x_k)$ has, and this property entails us to return the value $x_k$ as an observation on $X$, when $u\in I_k$.

If you implement this algorithm in C, Python or Java, then the entries of the arrays $x, p, F$, are indexed starting with 0, i.e. $x=[x_0, \ldots, x_{n-1}]$.

1

Sampling from a discrete pdf whose values $x_1,x_2,\ldots,x_n$ have probabilities $p_1,p_2,\ldots,p_n$ can be done with this simple Matlab function, which is the discrete version of the Inverse Transform method for generating random numbers from a given distibution.

function val = DiscrInvTrans1(x,p);     cdf = p(1);   % cdf is a scalar     u = rand;     k = 1;     while u > cdf         k = k+1;         cdf = cdf + p(k);     end     val = x(k); 

This builds the scalar cdf as it needs it. However this is for just one value. If this is to be used in a large sample of size $n_s$ then following function is better, where the cdf is calculated once by Matlab's cumsum function

function S = DiscrInvTransNs(x,p,ns);     cdf = cumsum(p);  % cdf is a vector     for i = 1:ns          u = rand;          k = 1;          while u > cdf(k)             k = k+1;          end          S(i) = x(k);     end 

The number of times the while statement is executed is a random variable $N_w$ whose expectation is $E(N_w) = \sum_{k=1}^n kp_k$. This is the same for both functions. If the pdf is uniform then $p_k = 1/n$ for all $k$ and so $E(N_w) = \sum_{k=1}^n kp_k = n(n+1)/2n = (n+1)/2$. For a general pdf we have no nice closed-form expression for $N_w$ but this assertion seems to be true:

The value of $E(N_w) = \sum_{k=1}^n kp_k$ with $\sum_{k=1}^np_k = 1,\,$ is minimized if $p_1 \ge p_2 \ge \cdots \ge p_n$, and maximized if $p_1 \le p_2 \le \cdots \le p_n$.

I'm not a mathematician and have not been able to prove this assertion except for $n=2$. I've tested it in Matlab on random pdfs and it has always been true.

Can anyone give a proof, or disprove it with a counter-example?

I have asked this question elsewhere on this site (without the algorithmic preamble) so I hope this is not considered a double post.

Update -- The assertion is true. The proof is here: Minimize Expected Value

0

Here's a very simple and not great generalized solution to the problem.

Given a list of N elements called elem. Given a list of their respective frequencies called freq.

First normalize the list of frequencies so it sums to 1 by diving by some constant. Partition the range of real numbers from 0 to 1 so that each frequency is assigned a portion of the range from 0 to 1 proportional to its size.

Have a uniform random number generator generate a number from 0 to 1. This number can then be interpreted as one of the N elements in elem based on freq's partitioning of the range of numbers between 0 and 1.

For example lets say we had three elements with relative frequencies 2, 1, and 1.

Normalizing the list we get the they have the respective probabilities 1/2, 1/4, 1/4. Therefore if the random number generator gives a number less than 1/2 we give the first value, then if it is less than 3/4 we give the second value, otherwise we give the third.

This is of course a terribly obvious general answer. If you have a specific discrete distribution, then there is often a better way of doing this.

  • 0
    Wouldn't this result in a lookup than was not constant? To find the correct element, we would have to binary search the probability array, right? It's a good first solution though.2011-07-01
0

So, in summary, you want to generate a random discrete variable $Z$ taking values in $[1 .. N]$ so that its probability function is linear: $P(Z) = a Z + b$

One possible recipe:

Let $Z = \lceil N \times Y \; \rceil $ where the operation ("ceiling") rounds up to an integer, and where $Y$ is a continuous variable taking values in $[0,1]$ with a linear probability density. This could be generated by throwing a uniform $X$ and taking $Y = X$ with probability $p$, $Y = \sqrt{X}$ with probability $1-p$. This gives a mixing of a uniform and a triangular, and then (if I've not messed up anything) :

$P(Z) = \frac{1- p}{N} + \frac{(2 Z -1) \; p}{N^2}$

You then just adjust the parameter $p$ to fit your need.

Eg: In Java:

 public static int discreteLinear (int N, double p) {     double x = random.nextDouble();     double y = random.nextDouble() < p ? x : Math.sqrt(x);     return (int)(Math.ceil( y*N ));  }