brettwooldridge / SparseBitSet

An efficient sparse bit set implementation for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Efficient Stream-Write...

ravikumarg opened this issue · comments

Thanks for SparseBitSet… It's so handy and we are using it in lot of places in our code…

There is just one issue I was hoping to help from…

OutputStream os = ...
for(int i=0;i<100k;i++) {
SparseBitSet sbs = new SparseBitSet(1million);
for(int j=0;j<1-million;j++) {
sbs.set(….)
}
os.write(sbs);
}
os.close();

The final size of the file after serializing 100k sparse-bits was very large. I then looked at the code for a single write…

There are many longs that are written for every sparse-bit-set. Most of the longs are also really large and I cannot do any techniques like variable-length encoding etc… to reduce the serialized size…

We can pass a LZO/GZIP output-streams to compress it. But want to take it as the last option…

Please let us know if the writeToStream code can be improved by any means so that resulting serialization footprint can be smaller

@ravikumarg Thanks, but unfortunately I can't take any of the credit for creating SparseBitSet, I just happened to find a research implementation.

While there may be a more efficient serialization, it is definitely not possible to change the writeObject() implementation unless a readObject() is also changed to provide backward compatibility with already serialized SparseBitSet objects.

You may instead want to look at alternative serialization formats to the standard Java format, for example Kryo. They claim high compression rates, along with JVM deserialization compatibility:

                                   create     ser   deser   total   size  +dfl
java-built-in                          63    5838   30208   36046    889   514
kryo                                   63     655     838    1493    212   132

Thanks for the pointer. Can def try out Kryo and see how it helps us

Let us know how it works, it may help other users looking for a similar solution.

@brettwooldridge Initially I wrote a new method that avoids Object Serialization but just writes ints/longs into passed OutputStream. [Like Hadoop RPC…]

Code snippet is just copied from existing writeObject with few "variable-int" writes as below

public void writeToStream(OutputStream s) throws IOException, InternalError
{

    ByteBuffer wordBuff = ByteBuffer.allocate(8);
    statisticsUpdate(); //  Update structure and stats if needed.
    writeInt(-2,s);//Header ignore
    writeVInt(compactionCount,s); //  Needed to preserve value
    writeVInt(cache.length,s); //  Needed to know where last bit is

    /*  This is the number of index/value pairs to be written. */
    int count = cache.count; //  Minimum number of words to be written
    writeVInt(count,s);
    final long[][][] a1 = bits;
    final int aLength1 = a1.length;
    long[][] a2;
    long[] a3;
    long word;
    int prev=0;
    for (int w1 = 0; w1 != aLength1; ++w1)
        if ((a2 = a1[w1]) != null)
            for (int w2 = 0; w2 != LENGTH2; ++w2)
                if ((a3 = a2[w2]) != null)
                {
                    final int base = (w1 << SHIFT1) + (w2 << SHIFT2);
                    for (int w3 = 0; w3 != LENGTH3; ++w3)
                    {
                        if ((word = a3[w3]) != 0)
                        {
                            int k=base + w3;
                            //To reduce size, we write the int-diff as a VInt here...
                            writeVInt(k-prev,s);
                            prev=k;
                            byte[] wordBytes = wordBuff.putLong(word).array();
                            s.write(wordBytes);
                            //No use writing as VLong, as it is a huge long number
                           //Also not possible to write the diff, as it is mostly a random-long
                          // and not any predictable sequence
                            //writeVLong(word, s);
                            wordBuff.clear();
                            --count;
                        }
                    }
                }
    if (count != 0) {
        throw new InternalError("count of entries not consistent");
    }
    /*  As a consistency check, write the hash code of the set. */
    writeInt(cache.hash,s);
}

The key point here is the number of longs getting written to stream are very large [3-D long-arrays from "bits" variable] and there is no scope to do any compression tricks there… I did try Kryo but no luck for me…

BTW, I also tried writing the random longs as a FST {Finite-State-Transducer} purely for experiment. Even then no luck as serialised-size was much larger