Random Sampling

Given a set of items, choosing random one with equal probability can be done by random.choice(items). But what if we are given items one by one without knowing the length of the whole set?

There is a simple algorithm: For the $k$-th item, we give up previous selection and choose this one with probability $\frac 1 k$. Proof as follows: If we select the $k$-th item, it means that we choose it with probability $\frac 1 k$ and give up all successors. Product of probabilities of all these choices is:

$$ p_k = \frac 1 k \times \frac {k} {k+1} \times \cdots \times \frac {n-2}{n-1} \times \frac {n-1} n = \frac 1 n $$

Codes in Python:

import random

def choice(items):
    selection = None
    for i, item in enumerate(items):
        if random.randint(0, i) == 0:
            selection = item
    return selection

We can generalize this algorithm. For a set of items, each one is associated with a weight $w_i$. Our goal is to select a random one based on its weight ratio. The algorithm is similar to the previous one: For the $k$-th item, replace the current selection with it with probability $\frac {w_k} {\sum_{i=1}^k w_i}$. Proof is analogical:

$$p_k = \frac {w_k} {\sum_{i=1}^k w_i} \times \frac {\sum_{i=1}^k w_i} {\sum_{i=1}^{k+1} w_i} \times \cdots \times \frac {\sum_{i=1}^{n-2} w_i} {\sum_{i=1}^{n-1} w_i} \times \frac {\sum_{i=1}^{n-1} w_i} {\sum_{i=1}^{n} w_i} = \frac {w_k} {\sum_{i=1}^n w_i}$$

Codes in Python:

def weightedChoice(items, weights):
    selection = None
    total_weight = 0.0
    for item, weight in zip(items, weights):
        total_weight += weight
        if random.random() * total_weight < weight:
            selection = item
    return selection

There is another amazing method to do this weighted random sampling: For each item, get a random $r_i \in [0, 1]$, and reweight this item as $w’_i = r_i^{1 / {w_i}}$. Then we can select the one with the top new weight. Proof of this is a bit annoying.

If we choose the $i$-th item at last, this means $\forall j\neq i, w’_j <w’_i$. As $r_i \in [0, 1]$, the probability is:

$$ p_i = \int_0^1 p(\forall j\neq i, w’_j <w’_i)\ \mathrm{d}\ r_i $$

$r_j$ is independent with each other. So:

$$ p_i = \int_0^1 \prod_{j \neq i} p(w’_j <w’_i)\ \mathrm{d} \ r_i $$

As $r_j \in [0, 1]$, the inner probability can be simplified as:

$$ p(w’_j <w’_i) = p(r_j^{1 / w_j} < r_i^{1 / w_i}) = p(r_j < r_i^{w_j / w_i}) = r_i^{w_j / w_i} $$

So:

$$ p_i = \int_0^1 \prod_{j \neq i} r_i^{w_j / w_i}\ \mathrm{d}\ r_i = \int_0^1 r_i^{(w-w_i) / w_i}\ \mathrm{d}\ r_i = \frac {w_i} w $$

Same as our expectation.

This blog is basically a summary of these three blogs. Refer to them for more content.