near / borsh

Binary Object Representation Serializer for Hashing

Home Page:https://borsh.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Capacity allocation for vector deserialization

ilblackdragon opened this issue · comments

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out (right now it would seg fault due to memory allocation error).

See test_invalid_length for example.

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out

One way is to wrap cursor with something like this:

use std::io;

pub struct LimitingReader<R> {
    reader: R,
    limit: u64,
}

impl<R> LimitingReader<R> {
    pub fn new(reader: R, limit: u64) -> Self {
        LimitingReader {
            reader,
            limit,
        }
    }
}

impl<R> io::Read for LimitingReader<R> where R: io::Read {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        match self.limit.checked_sub(buf.len() as u64) {
            Some(new_limit) => {
                self.limit = new_limit;
                self.reader.read(buf)
            }
            None => {
                Err(io::Error::new(io::ErrorKind::InvalidInput, "Read limit exceeded"))
            }
        }
    }
}

We should probably rewrite borsh to use slices instead of Read/Write. It might give us a significant performance boost. In that case, we could wrap the slice into LimitingReader instead.

So when you are suggested to use LimitingReader it might have referred to two different use cases:

  1. Making sure that as we deserialize data structure the deserialized object is less than X bytes, even if serialized form is valid, e.g. not corrupted;
  2. Making sure that as we deserialize data structure an invalid serialized form (e.g. due to corruption) does not cause us to allocate Vec<> of length larger than the actual number of elements available in the serialized form.

As now discussed with @ilblackdragon protecting from (1) does not make sense. In most cases the data structure we are deserializing into is not many orders of magnitude larger than its serialized form, so if we allowed allocating serialized struct (e.g. when receiving it from RPC or network) then it is safe to allow its deserialization, because deserializers are either implemented in: a) borsh for primitive types; b) in nearcore for our types; so we know they are not doing something crazy. It is however up to a third-party developer who tries using our borsh to not implemented deserializer that decodes 10 bytes into 10GiB under some condition.

Protecting from (2) seems to be already done through the fixes like #28

We still need to one more time go over deserializers implemented in borsh and nearcore to make sure we are protected from (2). Action item for @fckt

I just bumped into this. 42% of my application runtime is consumed by __rust_realloc due to the conservative capacity passed to Vec::with_capacity.

In my use case there's no risk of either (1) or (2) in @nearmax's comment. Perhaps the conservative behaviour could be configurable?