rust-fuzz / arbitrary

Generating structured data from arbitrary, unstructured input.

Home Page:https://docs.rs/arbitrary/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Arbitrary generates mostly empty strings

matusf opened this issue · comments

Hello, I'd like to use Arbitrary to create some Strings. However, I noticed that Arbitrary produces a lot of empty strings. Am I missing something? Thank you.

Test program:

use arbitrary::{Arbitrary, Unstructured};
use rand::{thread_rng, Rng};

fn main() {
    let mut seed = [0u8; 2048 * 2048];
    thread_rng().fill(&mut seed[..]);
    let mut generator = Unstructured::new(&seed);

    let t = (1..50)
        .map(|_| String::arbitrary(&mut generator).ok())
        .flatten()
        .collect::<Vec<String>>();

    println!("{:?}", t);
}

Output (ran several times) (mostly empty strings):

matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["u\r", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["١ȣ\u{5}", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["X", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
matus@pine arbit_test (master)> cargo -q r
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/arbit_test`
["v", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]

So, I think that's probably intended, if unfortunate. As per the current implementation we will interpret the unstructured data as UTF-8 while it corresponds to valid UTF-8 data.

When you generate the underlying data via pRNG, most of that data will not be a valid UTF-8, and thus you will end up with most offsets in a given unstructured buffer as invalid UTF-8.

Now, this works fine for fuzzing purposes, fuzzer can make informed decisions and keep the unstructured buffer mostly valid UTF-8. Not doing any complicated transformations on data also makes it easier for the fuzzer to figure out how to mutate data more effectively.

I'm not exactly sure how to best handle this kind of use-case.

Yeah, the goal of this is fuzzing, so I'd rather not support the non-fuzz use case more if we lose out on fuzzing efficiency.

Yeah, the goal of this is fuzzing, so I'd rather not support the non-fuzz use case more if we lose out on fuzzing efficiency.

Agreed. Main use case is making fuzzers easy and efficient to work with. Don't want to hurt that use case.

OK, thank you, I understand. I solved it by generating valid utf-8.