AirlineDog / elias-fano-compression

A Python script implementation of Elias-Fano compression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

elias-fano-compression

Let's say we have a sorted list of positive integers that we want to compress:

[5, 8, 11, 20, 33]

This list has n = 5 integers with m = 33 being the maximum. We are going to use two bytearrays to compress those integers. The first one L contains the last l = ⌊lg(m/n)⌋ = ⌊lg(33/5)⌋ = 2 bits of every integer in the list and has a size of n⌊lg(m/n)⌋

image

The second bytearray U will be used to represent the remaining bits of every integer. The number of those bits is u = ⌈lgm⌉ - l = ⌈lg33⌉ - 2 = 4 bits

image

But we are not going to store them exactly this way. Instead of the first bits we will use the difference with the previous one. That is because the difference b - a between two positive integers a and b with a <= b is always less than their original values and therefore we save space.

image

Those differences will be stored in a unary numbering system. For example the number 5 would be represented as five zeros followed by one as a separator.

image

There is also a different way to construct U which will for sure contain 1 n times as we need a 1 for every integer separator. Concerning the number of zeros, this will not be more than the number of the u first bits of the maximum number m. So, the number of zeros will not be greater than ⌊m/2^l⌋. In total U will contain n + ⌊m/2^l⌋ bits. We construct U by initializing every bit with zeros. For the integer in place i of the initial list, if the value of the first u bits is k, we set the i + k bit to 1.

The final representation of the initial list is the two bytearrays L and U Now let's see how much space we saved with the compression:

The initial list needs n⌈lgm⌉ bits if every integer is stored with the same number of bits as the maximum m
L needs n⌊lg(m/n)⌋ bits
U needs ⌊m/2^l⌋ bits

We save : n⌈lgm⌉ - n⌊lg(m/n)⌋ - ⌊m/2^l⌋ = n*lgm - n*lgm + n*lgn - m/2^l = n*lgn - m/2^l bits

This type of compression has the characteristic that we don't need to decompress all data to find the value of a part of them. Specifically, when we want to extract an integer we don't have to reproduce the whole list from U and L. For example, to find the third integer (starting from zero) of the list we know from the construction of L that the last bits are located in the [i*l, (i+1)*l) places of L. For the first bits we look at U and we find the third occurrence of 1 which is in the 8th place. If from this number we subtract the number of 1 we will get the number of 0 until this place. The number of zeros is the sum of the differences until the 8th 1 and so the first bits we are looking for : 8 - 3 = 5 => 0101

About

A Python script implementation of Elias-Fano compression


Languages

Language:Python 100.0%