Build process is killed unexpectedly.

Question

Build process is killed unexpectedly.

DonggeLiu opened this issue a year ago · comments

A process that builds Centipede with benchmark freetype2-2017 was killed unexpectedly.
The build log file did not show any error before cut off.
The gcloud log did not show useful info expect that it died with <Signals.SIGKILL: 9>.

The same error also happened on other benchmarks, e.g. harfbuzz.

make test-run-centipede-freetype2-2017 works perfectly.

Other info:
These processes were part of a locally-launched cloud experiment 2023-01-27-test-local-launch, with which we want to test if the latest framework changes will affect experiments launched from locally or requested by requests.yaml.

Dongge Liu commented a year ago

Thanks!

Oliver Chang · Answer 1 · Mon Jan 30 2023 08:10:16 GMT+0800 (China Standard Time)

Could this be due to the OOM killer?

Dongge Liu · Answer 2 · Mon Jan 30 2023 08:30:22 GMT+0800 (China Standard Time)

Could this be due to the OOM killer?

I do not recall any change in memory limit or related code since centipede's last successful experiment on the same benchmark freetype2-2017 on 18-01, though.

BTW, this error also affects other fuzzers, e.g. afl++.

Oliver Chang · Answer 3 · Mon Jan 30 2023 08:51:00 GMT+0800 (China Standard Time)

Can we see dmesg logs on these machines, or any memory usage stats via the GCP console for any indication of OOM?

Dongge Liu · Answer 4 · Mon Jan 30 2023 10:18:15 GMT+0800 (China Standard Time)

Can we see dmesg logs on these machines, or any memory usage stats via the GCP console for any indication of OOM?

Thanks!
I did not notice any from the gcloud log.
Would you know any way to access gcloud build instances? They are built with gcloud builds submit /work/src --config=/tmp/tmpg6woqb8o --timeout=46800s --worker-pool=projects/fuzzbench/locations/us-central1/workerPools/buildpool and are not the list of VM instances. I did not find a way to access them in the docs either.
Otherwise, we can probably add some debug log in the code to print memory info.

Oliver Chang · Answer 5 · Mon Jan 30 2023 13:16:45 GMT+0800 (China Standard Time)

Lookign around, I'm not sure if there's a way to view system logs in a CloudBuild worker pool.

Is this happening consistently? If so, might be worth changing the instance type to something with a bit more memory (e.g. from "e2-highcpu-32" to "e2-standard-32" temporarily to see if it fixes things.

Dongge Liu · Answer 6 · Mon Jan 30 2023 13:40:49 GMT+0800 (China Standard Time)

I don't recall seeing this in the past and am unsure if it is flaky.
A silly question: We have always been running those experiments in gcloud and it has always worked alright.
Is there a way to see if some default configuration was changed by gcloud?

At the same of running this experiment, I also request another one from GitHub.
In that experiment, the build logs of (centipede, freetype2-2017) terminated at the same step, while (afl++, libxml2) also terminated at similar steps.

Temporarily changing the instance type is a good idea. I will see if I can tweak that and launch another experiment on some selected (fuzzer, benchmark) pairs.

Dongge Liu · Answer 7 · Mon Jan 30 2023 14:18:01 GMT+0800 (China Standard Time)

Testing this in #1650.

van Hauser · Answer 8 · Mon Jan 30 2023 16:17:28 GMT+0800 (China Standard Time)

I see the same issue in my experiment:
https://www.fuzzbench.com/reports/experimental/2023-01-27-aflpp/index.html
80% !! of the experiments are failing to build

In the CI they all went green:
https://github.com/google/fuzzbench/actions/runs/4025274106/jobs/6918292861
(only one failing and that is one that is not in the coverage benchmark)

The build logs of the failing experiments all end in the middle of the build process without any errors being seen (gs://fuzzbench-data/2023-01-27-aflpp/build-logs/benchmark-libjpeg-turbo-07-2017-fuzzer-aflplusplus_at_cm.txt):

c655d92adaf3: Layer already exists
13bbbaf28a73: Layer already exists
546f9db501ca: Layer already exists
31fb99fed15d: Layer already exists
e65a3ff4b09d: Layer already exists
4dc14efe7306: Layer already exists
0002c93bdb37: Layer already exists

:-(

van Hauser · Answer 9 · Mon Jan 30 2023 17:52:45 GMT+0800 (China Standard Time)

maybe a feature that a benchmark run is aborted if > 25% of the build targets fail would be good? would prevent wasting resources.

jonathanmetzman · Answer 10 · Mon Jan 30 2023 23:57:07 GMT+0800 (China Standard Time)

maybe a feature that a benchmark run is aborted if > 25% of the build targets fail would be good? would prevent wasting resources.

Eh...I think there's too many smart features in FB as is.

jonathanmetzman · Answer 11 · Tue Jan 31 2023 00:51:02 GMT+0800 (China Standard Time)

I'm not sure this is a centipede issue. Centipede doesn't do anything too exotic in builds since it just uses clang, also Marc is pointing out issues with AFL++'s builds.
I'm not sure we should switch to the fancier instances. As we saw in oss-fuzz they can cost a lot more.

Dongge Liu · Answer 12 · Tue Jan 31 2023 06:58:40 GMT+0800 (China Standard Time)

I'm not sure this is a centipede issue. Centipede doesn't do anything too exotic in builds since it just uses clang, also Marc is pointing out issues with AFL++'s builds.

Yep, I don't think it is either.

I'm not sure we should switch to the fancier instances. As we saw in oss-fuzz they can cost a lot more.

We did not have this issue before, is there something that changed and caused this?
I switched to the instance with the highest memory for testing purposes.
Happy to test lower configurations to reduce cost.

Dongge Liu · Answer 13 · Tue Jan 31 2023 07:23:07 GMT+0800 (China Standard Time)

Build failure disappeared after using a higher memory worker pool instance.
More details about the sequence of experiments: #1626 (comment).

van Hauser · Answer 14 · Tue Jan 31 2023 16:00:09 GMT+0800 (China Standard Time)

maybe an update in the image (Ubuntu packages) increased the memory footprint? e.g. a docker update now needing more resourcen or something...
Are the docker images builds being performed with make -j? reducing the number of parallel processes would also reduce the memory required but of course lengthen the build process.

van Hauser · Answer 15 · Tue Jan 31 2023 17:01:34 GMT+0800 (China Standard Time)

can someone please kill the experiment 2023-01-29-aflpp please? again way too many build errors to be useful, so this is just wasting resources. thank you

van Hauser · Answer 16 · Tue Jan 31 2023 18:42:23 GMT+0800 (China Standard Time)

I noticed that an experiment with just 2 fuzzers built & ran cleanly, whereas one with 4 fuzzers had ~20% build failures and one with 5 fuzzers ~70% build failures.

maybe building less fuzzers in parallel on one instance is the easy solution?

jonathanmetzman · Answer 17 · Wed Feb 01 2023 01:26:16 GMT+0800 (China Standard Time)

2023-01-29-aflpp

Done.

jonathanmetzman · Answer 18 · Wed Feb 01 2023 01:26:56 GMT+0800 (China Standard Time)

I noticed that an experiment with just 2 fuzzers built & ran cleanly, whereas one with 4 fuzzers had ~20% build failures and one with 5 fuzzers ~70% build failures.

maybe building less fuzzers in parallel on one instance is the easy solution?

I don't think we are building multiple fuzzers on one instance though

van Hauser · Answer 19 · Wed Feb 01 2023 05:31:05 GMT+0800 (China Standard Time)

I noticed that an experiment with just 2 fuzzers built & ran cleanly, whereas one with 4 fuzzers had ~20% build failures and one with 5 fuzzers ~70% build failures.
maybe building less fuzzers in parallel on one instance is the easy solution?

I don't think we are building multiple fuzzers on one instance though

there is however a correlation between number of fuzzers in the benchmark and build failures.
I pushed a run with 2 fuzzers this morning - and it is running fine without any build failures, like the other one.
Where I pushed a run with 5 runners it again had 70% failures (the one you canceled for me).

van Hauser · Answer 20 · Wed Feb 15 2023 16:49:06 GMT+0800 (China Standard Time)

As I am doing lots of tests at the moment I can tell you this:

if more than 3 fuzzer builds are running (in one benchmark request or more, total count is what is important), then build failures occur. so if I requests 2 experiments at once, both with just 2 variants in there, or if I request 1 experiment with 4 variants, then there are fails. the more variants, the more fails, exponentially.

if I wait until the first requested benchmark is built and then request another (with max 2-3 variants) everything works fine.

jonathanmetzman · Answer 21 · Wed Feb 15 2023 23:04:46 GMT+0800 (China Standard Time)

What's the most recent experiment where this was an issue?

jonathanmetzman · Answer 22 · Wed Feb 15 2023 23:18:14 GMT+0800 (China Standard Time)

Well I'm quite sure this is a quota issue. These are appearing all over the logs (gs://fuzzbench-data/2023-01-29-aflpp/build-logs/) in experiments with build failures.

jonathanmetzman · Answer 23 · Wed Feb 15 2023 23:25:26 GMT+0800 (China Standard Time)

Looks like we regularly exceed this

jonathanmetzman · Answer 24 · Wed Feb 15 2023 23:34:56 GMT+0800 (China Standard Time)

We get higher than oss-fuzz wrt this metric. There are two relevant (i think) differences between how we do builds.

OSS-Fuzz (in trial builds) has a 1 second sleep.
OSS-Fuzz uses python libraries for submitting builds, not gcloud

van Hauser · Answer 25 · Wed Feb 15 2023 23:36:36 GMT+0800 (China Standard Time)

that could be the reason.
for the competition this will be an issue as these have to run in parallel?

jonathanmetzman · Answer 26 · Wed Feb 15 2023 23:37:32 GMT+0800 (China Standard Time)

that could be the reason. for the competition this will be an issue as these have to run in parallel?

I'm gonna try my best to fix this before the competition.

Dongge Liu · Answer 27 · Wed Mar 01 2023 11:08:22 GMT+0800 (China Standard Time)

I didn't see this error again in the past two week's experiment, shall we close this?

Oliver Chang · Answer 28 · Wed Mar 01 2023 11:51:31 GMT+0800 (China Standard Time)

Let's close. We can always re-open if we see it again. @jonathanmetzman you had a better longer term fix in mind, let's track that in a new bug.