Corpus collection times out

Question

Corpus collection times out

Lukas-Dresel opened this issue a year ago · comments

In #1827 the experiment reports only NaN for a variety of different fuzzers, with a specific bias towards Symsan and my tool.
I was able to reproduce these issues locally by making a set of test fuzzers to test out suspicions I had that could be causing it.

The root cause for both of the tools in this test seem to be the large logs they print to stdout/stderr. Unlike the corpus archives, the results are collected and copied/rsynced around uncompressed which causes dramatic slow downs.

However, during my testing I identified another cause that could be problematic for some fuzzers: large log-files with append semantics. I have a few of these in my fuzzer where I continuously append the current tick's information.

Unfortunately, because of the way fuzzbench performs the differential backups on a file-by-file basis (if the file changed, copy it in full again), this also can take a long time and gets slower and slower as time goes on. This unfortunately prevents the simple solution of just appending a 2>/out/stderr 1>/out/corpus/stdout to the offending fuzzers, because then the archiving will still have the same issues. At least in this case it gets compressed before being copied around, however.

In the way of a solution, I have deployed a multilog based log scheme now where the logs are rotated to new files once they exceed a certain capacity, which can make use of the by-file-differential copying better and keeps each file of manageable size.

I enabled this for symsan and my tool now, would you want to consider including it upstream? The script is rather simple and would in that case only add a daemontools dependency to the base-image and copy in a run_with_multilog.sh script that fuzzers can use by doing /run_with_multilog.sh $OUT/.log/ <cmd> for example.

TL;DR: fuzzbench handles large logs and large append-only files very badly and it affects symsan's and my results. It should either be added to the README to make sure it's known that this causes problems, or maybe a solution like my run_with_multilog.sh solution can be deployed in the infrastructure itself

Lastly, I think for future insights it might be good to include logging for time taken for the measurement, corpus copying, report generation etc. so such bottlenecks can be found. At the moment the only way to see it is that the wait period in runner-log-xxxx.txt in the dispatcher-container starts going negative. Maybe a prometheus + grafana solution to see time taken for each component for the current experiment or something like that so this can be caught early while the experiment is running.

Lukas Dresel · Answer 1 · Fri May 05 2023 01:42:46 GMT+0800 (China Standard Time)

It should be noted that this causes the affected fuzzers to perform both better and worse depending on the length of the experiment. The time-to-coverage will improve because they get measured for earlier ticks at increasingly later times, however, the longer the experiment goes the higher the chance they won't report results at all.

Ju Chen · Answer 2 · Fri May 05 2023 01:45:47 GMT+0800 (China Standard Time)

@Lukas-Dresel Do you mean that SymSan is impacted negatively?

Lukas Dresel · Answer 3 · Fri May 05 2023 01:54:31 GMT+0800 (China Standard Time)

@chenju2k6 Yes, it does not report results for about half the benchmarks at the end of my experiments because of it. But it is possible that it performs better not in terms of the final coverage numbers, but in the curve on the way there for the benchmarks where it is still reporting results.

Btw, even when it is reporting results at the end it's usually only for 1-5 out of 20 instances because of this issue, so it's hard to say if it improves the results, it could be the good instances or the bad instances dropping with some bias.