Coverage measurements invalidated by non-testcase fuzzer outputs

Question

Coverage measurements invalidated by non-testcase fuzzer outputs

Lukas-Dresel opened this issue a year ago · comments

Fuzzbench's approach to coverage measurement simply runs the coverage of every file in the fuzzer corpus directory. However, most fuzzers (e.g. all of the AFL++ versions) often simply place their output directory into this directory. This fundamentally skews the coverage for those fuzzers on benchmark that parse related file formats.

A simple case of this is seen in my fuzzbench experiment. The large boost my tools achieve at the beginning for bloaty_fuzz_target is because I include a copy of all of the compiled binaries in a hidden .run_info directory to be able to reproduce the experiment locally.

The instrumented files are ELF files, and bloaty_fuzz_target parses ELF files. Another more sneaky issue arises similarly if a fuzzer crashes and dumps a corefile into this directory so the maintainers can reproduce why it crashed (as for example symsan and aflplusplus do), this corefile will itself be measured towards the fuzzer's coverage (because it is a valid ELF file), and a crashing fuzzer can significantly benefit from it, at least in the case of bloaty. I have observed this at least once for SymSan. However, other scenarios are possible, for example a fuzzer dumping some status as JSON files while fuzzing jsoncpp, etc.

At the moment however, it is simply not possible to get information out of the experiment without having it be treated as a testcase (other than dumping it to stdout, which is a bad idea). As a solution I propose to either

Allow for the exclusion of hidden files (prefixed by a .), and modify the fuzzers to run inside a .cwd directory or something of that sort
Add a way for fuzzers to specify globs or paths which contain real testcases to be measured, and only measure those
Split out a separate file collection mechanisms only for testcases, so that maintainers can retrieve debug information and logging without having them be interpreted as testcases

I implemented 1. in my fork because it is simple to do, and none of the fuzzers I investigated stored crashes or testcases in a hidden directory. This also helps improve the performance of the measurements since unrelated .redundant_edges and .auto_extras in AFL++ for example aren't included (and shouldn't have been anyways)

jonathanmetzman · Answer 1 · Fri May 05 2023 03:09:01 GMT+0800 (China Standard Time)

Thank you for writing this very thoughtful and well researched response.

There are some special cases you mention but I'm not so worried about it. We're testing automated bug discovery and these quirks can help with that in real life (in ClusterFuzz we don't currently use the whole output directory, but from what you point out we would benefit from doing this). I think keeping the FuzzBench API simple is worth the inaccuracy this may bring. I'm not sure it's worth doing anything except removing bloaty.

Lukas Dresel · Answer 2 · Fri May 05 2023 04:18:34 GMT+0800 (China Standard Time)

@jonathanmetzman to clarify, this does not actually improve the performance, (they fuzzers don't pick those files up, they're pretty careful in what they synchronize in to keep performance optimal, e.g. afl internally knows the id of the input it expects to sync next, and doesn't look for anything else). It only skews the measurement in the plot. So this won't improve the fuzzing runs because the resulting inputs aren't picked up by the fuzzers in any way (as they shouldn't be).

My patch is very simple, behaves consistently with common conventions around hidden files and should not require any modifications for the common fuzzers I'm aware of. IMHO this should be fixed because it does not measure a fuzzer's exploratory power at all, simply the diversity of auxiliary and logging outputs it produces (basically, how well it can auto-generate high-quality seeds of common file formats completely by accident)

E.g. if a fuzzer plots an image showing the coverage graph over time, now you might have this fuzzer get really good coverage for the corresponding image parser, even though it might have crashed early and not really produced any outputs at all.

That example also shows that removing bloaty does not address the issue, because the fuzzed formats are quite common.