rust-fuzz / arbitrary

Generating structured data from arbitrary, unstructured input.

Home Page:https://docs.rs/arbitrary/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

size_hint() has unclear semantics

Xiretza opened this issue · comments

The exact semantics of size_hint() are not clear from looking at the documentation and the existing impls. The lower bound specifically could have two different meanings:

  • A hard limit. If there is less data available than specified, arbitrary() is expected to fail, don't even try to call it.
  • A soft limit. All unique outputs can be achieved using larger inputs, so there is no advantage to using inputs shorter than the lower bound - though it may still work.

Additionally, it's unclear if the bounds apply only to arbitrary() or also to arbitrary_take_rest(). For more complex types, these implementations can differ significantly, where arbitrary() often needs some additional metadata bytes (increasing the lower bound), but arbitrary_take_rest() just passes the entire data on to its inner types.

Here are a few examples:

  • Integer types are (size_of::<T>(), Some(size_of::<T>())). However, they are constructed using Unstructured::fill_buffer(), which if it "does not have enough underlying data to fill the whole buffer, it pads the buffer out with zeros.". Thus, it's possible to construct integers even from completely empty buffers, making this a soft limit.
  • Vec<T> is (size_of::<usize>(), None). This is completely incorrect, but assuming usize should be bool, it does describe a soft limit for arbitrary(): the empty vector can be achieved both using the empty input as well as a single false continuation byte, so the empty input is redundant. However, for arbitrary_take_rest(), this does not hold: the only way to get an empty Vec<u8> is from the empty input - the size hint has to be (0, None), which is also the hard limit for arbitrary().
  • &[u8] is (size_of::<usize>(), Some(size_of::<usize>())), which is even more wrong than Vec, but again (size_of::<u8>(), None) would be a correct soft limit for arbitrary() (the empty input is equivalent to a single length byte), but (0, None) is required for arbitrary_take_rest() or as an arbitrary() hard limit.

I think the implementations of some of these things have gotten out of sync with their size_hints.

The intended semantics are what you describe as a "hard limit": if there is less data than this, it isn't worth even trying to call T::arbitrary or T::arbitrary_take_rest. This is how libfuzzer-sys uses the size_hint: https://github.com/rust-fuzz/libfuzzer/blob/master/src/lib.rs#L170-L178

This is how libfuzzer-sys uses the size_hint

Yep, that's exactly how I found this. If fuzz_target!() is used to generate (T, U, ..., &[u8]) inputs, the byte slice will never be shorter than 8 elements. I'll craft up a PR.