# Random Sampling

Apr 17, 2013

Given a set of items, choosing random one with equal probability can be done by `random.choice(items)`

.
But what if we are given items one by one without knowing the length of the whole set?

There is a simple algorithm: For the $k$-th item, we give up previous selection and choose this one with probability $\frac 1 k$. Proof as follows: If we select the $k$-th item, it means that we choose it with probability $\frac 1 k$ and give up all successors. Product of probabilities of all these choices is:

\[p_k = \frac 1 k \times \frac {k} {k+1} \times \cdots \times \frac {n-2}{n-1} \times \frac {n-1} n = \frac 1 n\]Codes in Python:

```
import random
def choice(items):
selection = None
for i, item in enumerate(items):
if random.randint(0, i) == 0:
selection = item
return selection
```

We can generalize this algorithm. For a set of items, each one is associated with a weight $w_i$. Our goal is to select a random one based on its weight ratio. The algorithm is similar to the previous one: For the $k$-th item, replace the current selection with it with probability $\frac {w_k} {\sum_{i=1}^k w_i}$. Proof is analogical:

\[p_k = \frac {w_k} {\sum_{i=1}^k w_i} \times \frac {\sum_{i=1}^k w_i} {\sum_{i=1}^{k+1} w_i} \times \cdots \times \frac {\sum_{i=1}^{n-2} w_i} {\sum_{i=1}^{n-1} w_i} \times \frac {\sum_{i=1}^{n-1} w_i} {\sum_{i=1}^{n} w_i} = \frac {w_k} {\sum_{i=1}^n w_i}\]Codes in Python:

```
def weightedChoice(items, weights):
selection = None
total_weight = 0.0
for item, weight in zip(items, weights):
total_weight += weight
if random.random() * total_weight < weight:
selection = item
return selection
```

There is another amazing method to do this weighted random sampling:
For each item, get a random $r_i \in [0, 1]$,
and *reweight* this item as $w’_i = r_i^{1 / {w_i}}$.
Then we can select the one with the top new weight. Proof of this is a bit annoying.

If we choose the $i$-th item at last, this means $\forall j\neq i, w’_j <w’_i$. As $r_i \in [0, 1]$, the probability is:

\[p_i = \int_0^1 p(\forall j\neq i, w'_j <w'_i)\ \mathrm{d}\ r_i\]$r_j$ is independent with each other. So:

\[p_i = \int_0^1 \prod_{j \neq i} p(w'_j <w'_i)\ \mathrm{d} \ r_i\]As $r_j \in [0, 1]$, the inner probability can be simplified as:

\[p(w'_j <w'_i) = p(r_j^{1 / w_j} < r_i^{1 / w_i}) = p(r_j < r_i^{w_j / w_i}) = r_i^{w_j / w_i}\]So:

\[p_i = \int_0^1 \prod_{j \neq i} r_i^{w_j / w_i}\ \mathrm{d}\ r_i = \int_0^1 r_i^{(w-w_i) / w_i}\ \mathrm{d}\ r_i = \frac {w_i} w\]Same as our expectation.

This blog is basically a summary of these three blogs. Refer to them for more content.