RoaringBitmap / roaring

Roaring bitmaps in Go (golang), used by InfluxDB, Bleve, DataDog

Home Page:http://roaringbitmap.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[suggestion]: implement non-container-optimizing operations

Oppen opened this issue · comments

Many in-place operations convert between container types for efficiency. This makes a lot of sense when the operation involves only two containers/bitmaps, because you would want an optimal container at the end. However, multiple container operations would probably benefit from using a single accumulator container appropriate for the given task for most of the operation to avoid overloading the GC. For example, a bitset may be appropriate for operating on many containers at a time, but since we would need for them to interact with other container types we risk conversion between them. For this to be viable we would need to avoid these conversions altogether in these cases. So, the idea would be to, for example, have an or implementation for bitsets where any container simply accumulates its present values on it, without ever changing it to, for example, an RLE16 container.
I am not sure if those conversions tend to stabilize as we iterate over more containers tho.

So, the idea would be to, for example, have an or implementation for bitsets where any container simply accumulates its present values on it, without ever changing it to, for example, an RLE16 container.

Why don't the lazy OR functions suit this need?

roaring/bitmapcontainer.go

Lines 451 to 486 in ff33c3b

func (bc *bitmapContainer) lazyIOR(a container) container {
switch x := a.(type) {
case *arrayContainer:
return bc.lazyIORArray(x)
case *bitmapContainer:
return bc.lazyIORBitmap(x)
case *runContainer16:
if x.isFull() {
return x.clone()
}
// Manually inlined setBitmapRange function
bitmap := bc.bitmap
for _, iv := range x.iv {
start := int(iv.start)
end := int(iv.last()) + 1
if start >= end {
continue
}
firstword := start / 64
endword := (end - 1) / 64
if firstword == endword {
bitmap[firstword] |= (^uint64(0) << uint(start%64)) & (^uint64(0) >> (uint(-end) % 64))
continue
}
bitmap[firstword] |= ^uint64(0) << uint(start%64)
for i := firstword + 1; i < endword; i++ {
bitmap[i] = ^uint64(0)
}
bitmap[endword] |= ^uint64(0) >> (uint(-end) % 64)
}
bc.cardinality = invalidCardinality
return bc
}
panic("unsupported container type")
}

Note that we always leave the bitmaps in a good state after a user-visible operation, so lazy operations are for internal use only.

We do not want to expose container types as part of our public API: users should not have to worry about the internal implementation.

I thought they only saved the cardinality computing step, I didn't even think of those 🤦
I think I can close this one then, I can always reopen it if I find them unsuitable.

I thought they only saved the cardinality computing step

Generally, the idea is that they are 'lazy' and try to do as little work as possible while optimizing for the scenario where you are going to re-aggregate the container soon.