trials not launching in local experiment

Question

trials not launching in local experiment

dylanjwolff opened this issue a year ago · comments

I'm running into an issue running local experiments: the dispatcher fails to launch any of the individual trials. Even waiting for hours, no trials are launched, though the system appears to still be checking for coverage data in the measurement loop from the logs.

This happens every time I try to run a local experiment.

I've checked /tmp in the dispatcher container during one of these failed runs, and there are no logs from any of the runner startup scripts, nor do the runner containers appear to ever start. However, if I then manually run the startup script in the dispatcher container (e.g. docker exec -it dispatcher-container /bin/bash; /tmp/r-startup-script3.sh) the trial / runner container starts up with no issues.

I've tried adding some logging to the python code that starts the runners but everything looked fine from that end and I was not able to figure out what the cause of the issue is.

I've been able to reproduce this issue on multiple machines running recent versions of Ubuntu. The only changes I made to master were to add rsync to the relevant dockerfiles to work around: #1593

OS: Ubuntu (e.g. 22.04)
Docker version: e.g. 20.10.22
Commit: aa6ddd0
Reproduction:

add rsync to Dockerfiles;
make;
docker build the dispatcher image;
run an experiment with basic config e.g. experiment-config.yaml.txt using command such as
exp.sh.txt

Any insight would be much appreciated! And I'd be happy to provide logs / additional information as needed.

Thanks!

jonathanmetzman · Answer 1 · Wed Jan 04 2023 05:56:09 GMT+0800 (China Standard Time)

So I saw an issue similar to #1223 that prevented my experiment from starting. Maybe this is the fix? Anyway trying to repro.

Dylan J. Wolff · Answer 2 · Thu Jan 05 2023 19:57:24 GMT+0800 (China Standard Time)

I tried installing a bunch of qt related packages (qt5-default is no longer available in Ubuntu 22.04) and there was no change in behavior. It seems this is a different problem anyways as I think #1223 was erroring out even during make presubmit, which is not the case for me.

Any luck reproducing the issue? Let me know if there's any other information I can provide that would be helpful

Dongge Liu · Answer 3 · Mon Jan 09 2023 11:25:13 GMT+0800 (China Standard Time)

Thanks @dylanjwolff, Jonathan's fix #1595 seems to work for me, though.

Would there be any chance that you happen to forget to delete the old local image gcr.io/fuzzbench/dispatcher-image and hence FuzzBench did not automatically pull the latest version?
I was able to reproduce the error with the old image, but can successfully start local experiments after deleting it and letting FuzzBench automatically pull the new one.

Please let me know if that works, thanks : )

Han Zheng · Answer 4 · Mon Jan 09 2023 13:51:21 GMT+0800 (China Standard Time)

Hello, I have the same issue and I delete all old image and start again. The log changed as follow but the experiment still didn't start:

INFO:root:Starting experiment.
INFO:root:Building measurers.
INFO:root:Concurrent builds: 30.
INFO:root:Building using (<function build_measurer at 0x7f41c684c8b0>): [('bloaty_fuzz_target',)]
INFO:root:Building measurer for benchmark: bloaty_fuzz_target.
INFO:root:Done building measurer for benchmark: bloaty_fuzz_target.
INFO:root:Build successes: [('bloaty_fuzz_target',)]
INFO:root:Done building measurers.
INFO:root:Building all fuzzer benchmarks.
INFO:root:Concurrent builds: 30.
INFO:root:Building using (<function build_fuzzer_benchmark at 0x7f41c684caf0>): [('afl', 'bloaty_fuzz_target')]
INFO:root:Building benchmark: bloaty_fuzz_target, fuzzer: afl.
INFO:root:Done building benchmark: bloaty_fuzz_target, fuzzer: afl.
INFO:root:Build successes: [('afl', 'bloaty_fuzz_target')]
INFO:root:Done building fuzzer benchmarks.
INFO:root:Starting scheduler.
INFO:root:Finding trials to schedule.
INFO:root:Starting trials.
INFO:root:Start trial 1.
INFO:root:Started 1 trials.
INFO:root:Start measuring.
INFO:root:Start measure_loop.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:In progress: True.
INFO:root:Is merging with nonprivate: False.
INFO:root:Reading experiment data from db.
INFO:root:Done reading experiment data from db.
WARNING:root:No snapshot data.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.

And current docker container's status is :

CONTAINER ID   IMAGE                                                   COMMAND                  CREATED         STATUS                     PORTS                                                      NAMES
632bb0fa02f4   gcr.io/fuzzbench/builders/coverage/bloaty_fuzz_target   "/bin/bash -c '(cd /…"   5 minutes ago   Exited (0) 4 minutes ago                                                              cool_montalcini
723b34bae547   gcr.io/fuzzbench/dispatcher-image                       "/bin/bash -c 'rsync…"   7 minutes ago   Up 7 minutes                                                                          dispatcher-container

I didn't find any running afl instance or container. Could anyone give me some idea that what might went wrong? Thanks. @alan32liu

My environment: Debian 11, docker 20.10.5, fuzzbench f7ab64d

Dongge Liu · Answer 5 · Mon Jan 09 2023 14:06:27 GMT+0800 (China Standard Time)

Hi @kdsjZh, I will need a bit more information to debug this:

Did you manage to get some fuzzer log?
Is that the end of all outputs?

Thanks.

Han Zheng · Answer 6 · Mon Jan 09 2023 14:14:05 GMT+0800 (China Standard Time)

Thanks for the quick reply @alan32liu
(1) No, I use fuzzbench's AFL, which didn't have any log by default, and it didn't start. (Since if I run it manually, I could find one afl-fuzz instance and a corresponding container from gcr.io/fuzzbench/runners/afl/bloaty_fuzz_target. But I didn't find any of them)
(2) It will repeatedly output something like below.

INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:Measuring all trials.
INFO:root:In progress: True.
INFO:root:Is merging with nonprivate: False.
INFO:root:Reading experiment data from db.
INFO:root:Done reading experiment data from db.
WARNING:root:No snapshot data.
INFO:root:Measuring all trials.
INFO:root:Finding trials to schedule.
INFO:root:Starting trials.
INFO:root:Started 0 trials.

Dylan J. Wolff · Answer 7 · Mon Jan 09 2023 14:17:16 GMT+0800 (China Standard Time)

Thanks for looking into this @alan32liu!

I can confirm that -- at least for me -- I am still seeing the same issue, even starting from a clean slate (no docker images locally, fresh venv, latest version of master). And it seems to me @kdsjZh has this problem as well.

Unfortunately part of the problem is that there are no errors in the dispatcher logs that I can see, and the individual runners don't produce any log files at all since it appears they are never started:

If I run the scripts in /tmp manually, the runners will launch properly.

Dongge Liu · Answer 8 · Mon Jan 09 2023 14:34:11 GMT+0800 (China Standard Time)

Thanks, both!
I wonder if this is due to mismatching the start script name.
I will investigate this tmr (if not tonight) and keep you updated.