size_hint() has unclear semantics
Xiretza opened this issue · comments
The exact semantics of size_hint()
are not clear from looking at the documentation and the existing impls. The lower bound specifically could have two different meanings:
- A hard limit. If there is less data available than specified,
arbitrary()
is expected to fail, don't even try to call it. - A soft limit. All unique outputs can be achieved using larger inputs, so there is no advantage to using inputs shorter than the lower bound - though it may still work.
Additionally, it's unclear if the bounds apply only to arbitrary()
or also to arbitrary_take_rest()
. For more complex types, these implementations can differ significantly, where arbitrary()
often needs some additional metadata bytes (increasing the lower bound), but arbitrary_take_rest()
just passes the entire data on to its inner types.
Here are a few examples:
- Integer types are
(size_of::<T>(), Some(size_of::<T>()))
. However, they are constructed usingUnstructured::fill_buffer()
, which if it "does not have enough underlying data to fill the whole buffer, it pads the buffer out with zeros.". Thus, it's possible to construct integers even from completely empty buffers, making this a soft limit. Vec<T>
is(size_of::<usize>(), None)
. This is completely incorrect, but assumingusize
should bebool
, it does describe a soft limit forarbitrary()
: the empty vector can be achieved both using the empty input as well as a singlefalse
continuation byte, so the empty input is redundant. However, forarbitrary_take_rest()
, this does not hold: the only way to get an emptyVec<u8>
is from the empty input - the size hint has to be(0, None)
, which is also the hard limit forarbitrary()
.&[u8]
is(size_of::<usize>(), Some(size_of::<usize>()))
, which is even more wrong thanVec
, but again(size_of::<u8>(), None)
would be a correct soft limit forarbitrary()
(the empty input is equivalent to a single length byte), but(0, None)
is required forarbitrary_take_rest()
or as anarbitrary()
hard limit.
I think the implementations of some of these things have gotten out of sync with their size_hint
s.
The intended semantics are what you describe as a "hard limit": if there is less data than this, it isn't worth even trying to call T::arbitrary
or T::arbitrary_take_rest
. This is how libfuzzer-sys
uses the size_hint
: https://github.com/rust-fuzz/libfuzzer/blob/master/src/lib.rs#L170-L178
This is how
libfuzzer-sys
uses thesize_hint
Yep, that's exactly how I found this. If fuzz_target!()
is used to generate (T, U, ..., &[u8])
inputs, the byte slice will never be shorter than 8 elements. I'll craft up a PR.