lfoscari/mercator-sieve

Mercator sieve

The sieve is formed by an array S stored in central memory, which contains URL
signatures. The array is initially empty, and is filled incrementally. The size
of the vector is fixed at some value n in. In mass storage, instead, we will keep
a file Z containing all signatures of previously known URLs in sorted order,
and an auxiliary file A, both initially empty.

Each time a URL u is added to the sieve, we add h(u) to S and u to A. The key
point is what happens when S contains n signatures: in this case, we perform a
flush as follows:

1. We sort S indirectly. That is, we sort stably a vector V of length that
contains the numbers in [0..n) using as key S[i]. At this point V[i]
contains the index in S of the signature of rank i (i.e., that i-th signature
in sorted order), and so the signatures S[V[i]] appear in order as i grows.

2. Using this property, we deduplicate S: that is, we mark as useless all
duplicate signatures, using only the first one. Note that we are exploiting the
fact that the sorting algorithm is stable (or we might mark as representative
an occurrence which is not the first one).

3. Now we merge Z with the marked signatures into a new file Z'. We can do
this in linear time and scanning Z sequentially because the S[V[i]]'s are
sorted. We mark the signatures in S which are not duplicates and do not appear
in Z.

4. Finally, we scan A and S in parallel, and for each signature marked in S we
output the corresponding URL in A. Note that the accesses to A is purely
sequential, since S is in the same order (and this is the reason why we
performed an indirect sorting).

5. S and A are emptied, and Z replaced with Z'.

Note that Z at the end of a flush contains again the signatures of all URLs we
have ever seen. Moreover, in output we have produced all and only URLs whose
signature is not part of Z, and thus (modulo collisions) all and only unknown
URLs. Finally, the URL in output are emitted exactly in the order in which they
appear in A for the first time.

credits Sebastiano Vigna @ Unimi
lfoscari / mercator-sieve

About

Languages