Sieving Primes

Sat, Jan 3, 2009 code, math, laziness, primes, sieve

Here’s an interesting lazy variant of the sieve of Eratosthenes by Nick Chapman, from a discussion on LTU:

    primes = 2 : diff [3..] (composites primes)

    composites (p:ps) = cs
        where cs = (p*p) : merge (map (p*) (merge ps cs)) (composites ps)

    merge (x:xs) (y:ys) | xy           = diff (x:xs) ys

In words, the algorithm computes the list of primes as all numbers which are not composites, and the computes the list of composites by extremely constructing and merging all prime factorizations.

As the discussion shows, this is also a great example of why laziness hasn’t caught on: no one could figure out how fast it was.

To see that the algorithm computes each prime $p$ after finite time, note that all recursive calls are “upwards” towards higher prime factors. If $q^2 \gt p$, $composites (q:...)$ won’t be expanded beyond the first term $q^2$, and therefore for $r \gt q$ $composites (r:...)$ will never be called.

Only $O(n)$ multiplications are used (one for each composite), so the cost is the construction and merging of the partial lists of composite numbers. The total runtime complexity of the algorithm is equal to the total number of terms expanded out of the following lists:

composites (p:ps) for each prime $p$.
Each merge ps cs.
Each map (p*) (merge ps cs).

Say we evaluate all primes less than some integer $n$. The cost of (2) and (3) are at most a constant factor greater than the corresponding (1), so we need to compute the number of terms computed from each composites (p:...). At most one number $\gt n$ will appear in each, so we can restrict considering to numbers $\lt n$. Since a given composite $k \lt n$ appears in composites (p:...) iff all primes dividing $k$ are $\ge p$, the total cost due to $k$ is roughly $\pi(\operatorname{lpf}(k))$ where $\operatorname{lpf}(k)$ is the least prime factor of $k$.

Thus the complexity of the algorithm is given by the partial sums of $\pi(\operatorname{lpf}(\cdot))$ over the composite numbers. That is,

$$C(n) = \sum_{n \gt k \notin \mathbb{P}} \pi(\operatorname{lpf}(k))$$

I can’t find much information about the $\operatorname{lpf}$ function online, and I didn’t manage to estimate it analytically. Numerically, it looks like $C(n)$ is between $O(n \log^2 n)$ and $O(n \log^3 n)$.

So, not $O(n)$, but $\tilde{O}(n)$ is pretty good, and it’s cool that you can do it with lists alone.

Update: my complexity estimate is wrong. Here’s the correct analysis.