brettwooldridge / SparseBitSet

An efficient sparse bit set implementation for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Java memory size much larger than raw bytes

ajorgensen opened this issue · comments

This may be more of a question than an issue but i'm trying to reason about why the byte representation of this class is so much smaller than the actual java representation.

If I plot out the memory that the data structure takes on a graph from 0 entries to 10M it looks something like this:

image

So it very quickly climbs up and then reaches a maximum size but the actual serialized size grows linearly with the data set.

I guess my question is whether this is a product of the implementation of this data set and if there is some tuning that we could do to help the in memory size grow more closely to the actual byte representation of the data structure.

Can I suggest you represent the number of bytes per entry on the y axis?

The entries in this case are simply UUIDs that get hashed with:

  public static int hash(String id) {
    int x = id.hashCode();
    return Math.abs(x) % (Integer.MAX_VALUE - 1);
  }

I then just called set on the SparseBitSet for the index that it is hashed to. So in this case the possible values that can be added into the set are between 0 and Integer.MAX_VALUE - 1

I experimented with pairing down the possible index space by modding by a value that is less than Integer.MAX_VALUE - 1 which predictably increased the number of collisions and did improve memory used however the similar difference between the serialized byte representation and the in memory representation.

@lemire FWIW I looked at your profile and saw https://github.com/RoaringBitmap/RoaringBitmap. I ran it through the same test as the SparseBitSet and it seems to produce a more consistent result.

image

@ajorgensen

Your plots are nice. You should open source the code you use to generate them. I'd replace megabytes on the y-axis by the number of bytes used by value... it would make it easier to reason about memory usage.

Anyhow, I have my own benchmarks elsewhere (you can find them on my profile) which indicate that SparseBitSet is consistently fast... but it does use more memory than some alternatives.

It is interesting to ask if the memory usage could be reduced. I suspect it might.

@lemire Sounds great. I can put the code up in a gist, the plot is just a google spreadsheet.

For our use case I think memory is more important than raw performance so we're going to put the RoaringBitmap through real world data and compare the results but from the small test I did it looks awesome.

I'll go check out your benchmarks. Thanks for the help!

One simple thing that I did try was converting all the long internal maps for int and it reduced the memory footprint by about half (which is understandable). I didnt quite get serialization/deserialization working correctly. It would be interesting to think about ways the internal data representation could be modified to tune how fast the memory profile grows.

@ajorgensen For your use-case (using a hash as the bit index), I think SparseBitSet is definitely sub-optimal, and a compressed bitset is likely to serve you better.

Basically, an even random distribution of bits is the worst-case scenario for SparseBitSet in terms of memory footprint (CPU should be unaffected). On the other hand, clustered groups of bits, or long runs of bits are going to work well.

Given that a good hash should be randomly distributed over the range of its type (32/64-bit), I don't think SparseBitSet is what you are looking for. Unless your UUIDs has some subcomponent that was roughly sequential/chronology-based, and you could use that as the bit-index, in which case SparseBitSet would likely consume quite a bit less memory.

@brettwooldridge thanks so much for the information, I think thats the piece that I was missing. Our UUIDs are just standard UUIDs so should be pretty well distributed. Do you think it would be possible to put some of this information in the readme? I could generate some graph similar to the ones above to try and illustrate that information if you think it would be helpful. Might help people trying to make a decision about which solution is the best for their use case.