Reduce allocations in `BytesEncode`
nolanderc opened this issue · comments
The BytesEncode
trait is currently defined as follows:
Lines 20 to 26 in 3406383
There are some issues with this definition:
- Unless the encoded type can be trivially transmuted to a slice of bytes (
str
,[u8]
,U32<NativeEndian>
, ...) we have to allocate a temporary buffer where the value is written. - There is no way to get a temporary buffer without going through the global allocator.
- Allocating a temporary buffer incurs an extra copy.
For example, most architectures in use are Little-Endian (x86, Arm), however, for lexicographic ordering to correctly sort integers, we have to store them in Big-Endian. Since this incurs a byte-swap on Little-Endian architectures we cannot simply return a pointer to the integer itself, instead we have to allocate a temporary buffer where we store the byte-swapped result. That's one heap allocation for each integer, something which could have been done trivially on the stack or in a CPU register.
Optimizations
In most cases we know a-priori the maximum size of the needed allocation:
- LMDB forces keys to be 511 bytes or less.
- Primitive types and arrays have a compile-time known size.
In some cases it may be preferable to run serialization in two phases:
- First we serialize to a "null" writer which counts the number of bytes needed for the value.
- With this size in hand, we can have LMDB reserve a memory area in the database which fits the serialized value (using
MDB_RESERVE
). - We then serialize our value directly into the reserved space.
A new trait
With the above in mind, we can make a few changes to the BytesEncode
trait (naming TBD):
/// Set to `true` if the value can be trivially casted to `&[u8]` (e.g., `str`, `[u8]`, ...).
/// This is used as a hint by the library to avoid using intermediate buffers.
const ZERO_COPY: bool = false;
/// A hint to the database about the final size of the encoded value.
/// Can be used to reduce allocations and memory copies.
///
/// Must be exactly equal to the final encoded size,
/// otherwise encoding will result in an error.
fn size_hint(item: &Self::EItem) -> Option<usize> {
None
}
/// Encode the value into a pre-allocated buffer.
/// The buffer could be allocated on the stack if we are encoding keys (511 bytes),
/// or allocated on the heap (with an initial capacity equal to `size_hint`).
fn encode_writer<W: std::io::Write>(item: &Self::EItem, writer: &mut W) -> Result<()> {
// the default impl forwards to the old method
writer.write_all(Self::bytes_encode(item)?)?;
Ok(())
}
Based on the value of ZERO_COPY
we can decide between either calling bytes_encode
and directly get our slice, or using a pre-allocated buffer with encode_writer
. (If the )
We could also consider making encode_writer
the required method, and define bytes_encode
in terms of encode_writer
with Vec
as our Write
r. However, this would be a breaking change.
Example Usage
Let's look at how we might implement Database::put
with the above definition (pseudocode):
fn put(db, key: impl BytesEncode, value: impl BytesEncode) {
if Key::ZERO_COPY {
key_bytes = key.encode_bytes()?; // should yield &[u8] without allocation
} else {
let mut buffer = [0u8; 511];
let mut array_writer = std::io::Cursor::new(&mut buffer);
key.encode_writer(&mut array_writer);
key_bytes = &buffer[..array_writer.position()];
}
if !Value::ZERO_COPY {
if let Some(size) = value.size_hint() {
// allocate space directly in DB for value before writing
return db.put_reserve(key, size, |space| value.encode_writer(space));
}
}
// Should yield `&[u8]` directly if ZERO_COPY is true.
// Otherwise, the implementation likely makes better decisions about allocating a `Vec`.
// Although, we could possibly pre-allocate a small buffer on the stack, something the implementation cannot.
value_bytes = key.encode_bytes()?;
mdb_put(db, key_bytes, value_bytes)?;
}
Note that we can avoid all temporary allocations and intermediate memory copies in all cases except one: if the resulting size of value
is unknown.
Thank you for this proposal @nolanderc, very appreciated 😃
However, there is one minor issue with the ZERO_COPY
parameter; it should be possible to dynamically determine it from the given Self::Item
. I advise you to look at the Meilisearch use case, which is probably the project that uses heed the most. We use a CboRoaringBitmap
(using RoaringBitmap
s) codec that conditionally stores integers one after the others (when there are not many) or uses the default serialization method.
I am pretty sure we can apply this optimization to the Meilisearch codebase and see the performance gains 🤩