Extending bug benchmarks for diverse oracles

Question

Extending bug benchmarks for diverse oracles

addisoncrump opened this issue a year ago · comments

As discussed in FUZZING'23, we are looking to extend Fuzzbench to allow for the detection of diverse bug oracles. Namely, we are interested in:

differential oracles (crash when output of two targets is dissimilar)
property-based oracles (crash when some property of the output with respect to the input is not upheld)

In both of these cases, we generally have only one or a few assertions that are evaluated after the harness executes, and therefore bug deduplication by stacktrace is not sufficient for these oracles. Worse yet, we can have multiple bugs triggered by a single crashing testcase.

To accommodate for this, we want to extend bug benchmarks for harnesses with these special oracles, using stacktrace deduplication otherwise. In order of preference, we've considered:

bug detection by reverse-applying patches
- only works for known bugs
- can be "fixed" after the experiment by manual inspection of unknown bugs
- longer build times for evaluator
bug detection by git revision bisection
- works for unknown bugs
- doesn't detect the presence of multiple bugs in a single crash
- doesn't work for non-git targets
- potentially slow
bug detection by root-cause clustering vis-a-vis IGOR
- imprecise
- inconsistent reporting over the run due to re-clustering

Potentially there are other deduplication strategies as well that we've not yet considered.

Before continuing with these options, we wanted to open this up for feedback and ensure that the changes applied can be upstreamed for future testing with other fuzzers than that which we're currently evaluating.

jonathanmetzman · Answer 1 · Fri Sep 15 2023 21:24:16 GMT+0800 (China Standard Time)

In both of these cases, we generally have only one or a few assertions that are evaluated after the harness executes, and therefore bug deduplication by stacktrace is not sufficient for these oracles. Worse yet, we can have multiple bugs triggered by a single crashing testcase.

I think we've basically decided to go with one crash per target as a way of deduping at this point. So deduping shouldn't be a problem.

Addison Crump · Answer 2 · Fri Sep 15 2023 21:41:42 GMT+0800 (China Standard Time)

I see. So it's a race to when the crash is first discovered? What happens if we haven't successfully enumerated all crashes?