google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.

Home Page:https://google.github.io/fuzzbench/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sampling Initial Seed Corpus and Analysis

dylanjwolff opened this issue · comments

TO @jonathanmetzman @lszekeres
CC @mboehme @inferno-chromium

We have two related features which we've implemented on a private fork that we'd like to integrate into Fuzzbench. The first is the ability to sample from a larger pool of seeds to provide a unique corpus to each fuzzer per trial during a benchmarking run. The second consists of additional data-analysis to give some insight into how various aspects of the initial corpora and programs under test might be affecting benchmarking outcomes.

The purpose of this issue it to establish the following:

  1. [Sampling] We currently have a script we've been using for local experiments that samples from e.g. a project's OSS-Fuzz corpus to generate random initial corpora. We then mount those in the docker containers of the runners. We also kick off the first measurer cycle before launching the fuzzer process to grab the initial coverage of the corpus. Are there other considerations or another approach we should take for adding this feature?
  2. [Properties] Which properties would you consider to be interesting? We currently have
    • seed-corpus: initial coverage, number of seeds, average seed exec time, average seed size
    • program: size (and others). Anything else that you would like to look at?
  3. [UI/UX] What is the interface that you want to present to the users? For the seed sampling, probably additional field(s) in the YAML configuration file to select a sampling level / strategy? For the data presentation, we have produced several visualizations which show the relative impact of a particular property on the final ranking of a fuzzer or its coverage. We are happy to share these separately and would welcome any feedback you might have on where and how to present this data in a Fuzzbench report.

Thanks!

The key idea is essentially: Instead of saying fuzzer A is the top fuzzer in general, we could say that Fuzzer A is the top fuzzer under these circumstances while fuzzer B is the top fuzzer under those other circumstances. For any given benchmark run, a user could essentially use a slider on those benchmark properties to see how the fuzzer ranking changes.

Sorry for the delay, I've had a bit of a crazy schedule with my holidays.
I personally think the second might be more interesting and seems less of a maintenance burden (the analysis just gets done at the end right?) But I'm interested in seeing both.

The UI/UX question is tricky, I don't have any answers let me think about it more. I'm happy to see your samples as well.

  1. [Properties] Which properties would you consider to be interesting? We currently have

    • seed-corpus: initial coverage, number of seeds, average seed exec time, average seed size
    • program: size (and others). Anything else that you would like to look at?

Would it make sense to compare different performances after:

  1. tuning the hyper-parameters assumed by the fuzzers (e.g., maximum input length), or
  2. changing the default heuristic used by the fuzzers (e.g., libFuzzer can try to generate small inputs first)?

Also, for fuzzers that can take an input keywords dictionary, maybe we could sample the items in the dictionary in the same way as sampling the initial corpus?

Would it make sense to compare different performances after:

  1. tuning the hyper-parameters assumed by the fuzzers (e.g., maximum input length), or
  2. changing the default heuristic used by the fuzzers (e.g., libFuzzer can try to generate small inputs first)?

Also, for fuzzers that can take an input keywords dictionary, maybe we could sample the items in the dictionary in the same way as sampling the initial corpus?

Absolutely! However, this might be more difficult to implement. You'll need to expose some API that the fuzzer developer can use to specify what to vary during the benchmarking.

I personally think the second might be more interesting and seems less of a maintenance burden (the analysis just gets done at the end right?) But I'm interested in seeing both.

Yup, the analysis portion is just some post-processing that can be run on something similar to the final report data CSV file. But, without corpus sampling, you could only look at the effects of program properties as the corpus would be constant across trials.

Adding on to @mboehme's and @Alan32Liu's comments about fuzzing parameters: it a very interesting idea, but I agree the implementation (and maintenance) effort needed is probably quite high to get many different fuzzers to present a similar interface for various parameters. Dictionaries would be more doable, as that is at least already a consistent "interface" across fuzzers.

Please feel free to let me know if there is anything that I could help with : )