Building a process binary for two different platforms with the same target triple causes a race condition.

Question

Building a process binary for two different platforms with the same target triple causes a race condition.

jrvanwhy opened this issue 2 years ago · comments

Question: If you try to build process binaries with two different layouts concurrently, does it result in a race condition? I'm primarily concerned with the layout file selection in libtock_runtime's build.rs.

If the answer is yes, then we should investigate how to fix the race.

This question becomes more important if we add functionality to build.rs to support building apps for multiple slots.

Johnathan Van Why · Answer 1 · Sat Feb 05 2022 07:50:59 GMT+0800 (China Standard Time)

It's very hard to understand what cargo does by reading its code, so I cloned it and added print statements to the file lock and unlock functions. I've determined that:

When building multiple crates, cargo check, cargo build, and cargo run locks the build directory before it starts building the first crate, and does not unlock it until it is finished with the last crate. This means that compiling crates is not racy.
cargo run unlocks the build directory before it executes the binary (or invokes the runner if one is provided), so cargo run invocations can race.

For the record, I used cargo commit 07e9d46e3 and the following diff:

diff --git a/src/cargo/util/flock.rs b/src/cargo/util/flock.rs
index 755bcdcd9..bb099e6ff 100644
--- a/src/cargo/util/flock.rs
+++ b/src/cargo/util/flock.rs
@@ -349,23 +349,38 @@ mod sys {
     use std::os::unix::io::AsRawFd;
 
     pub(super) fn lock_shared(file: &File) -> Result<()> {
-        flock(file, libc::LOCK_SH)
+        println!("START: Instance {:?} lock_shared {:?}", std::env::var("INSTANCE"), file);
+        let out = flock(file, libc::LOCK_SH);
+        println!("END:   Instance {:?} lock_shared {:?}", std::env::var("INSTANCE"), file);
+        out
     }
 
     pub(super) fn lock_exclusive(file: &File) -> Result<()> {
-        flock(file, libc::LOCK_EX)
+        println!("START: Instance {:?} lock_exclusive {:?}", std::env::var("INSTANCE"), file);
+        let out = flock(file, libc::LOCK_EX);
+        println!("END:   Instance {:?} lock_exclusive {:?}", std::env::var("INSTANCE"), file);
+        out
     }
 
     pub(super) fn try_lock_shared(file: &File) -> Result<()> {
-        flock(file, libc::LOCK_SH | libc::LOCK_NB)
+        println!("START: Instance {:?} try_lock_shared {:?}", std::env::var("INSTANCE"), file);
+        let out = flock(file, libc::LOCK_SH | libc::LOCK_NB);
+        println!("END:   Instance {:?} try_lock_shared {:?}", std::env::var("INSTANCE"), file);
+        out
     }
 
     pub(super) fn try_lock_exclusive(file: &File) -> Result<()> {
-        flock(file, libc::LOCK_EX | libc::LOCK_NB)
+        println!("START: Instance {:?} try_lock_exclusive {:?}", std::env::var("INSTANCE"), file);
+        let out = flock(file, libc::LOCK_EX | libc::LOCK_NB);
+        println!("END:   Instance {:?} try_lock_exclusive {:?}", std::env::var("INSTANCE"), file);
+        out
     }
 
     pub(super) fn unlock(file: &File) -> Result<()> {
-        flock(file, libc::LOCK_UN)
+        println!("START: Instance {:?} unlock {:?}", std::env::var("INSTANCE"), file);
+        let out = flock(file, libc::LOCK_UN);
+        println!("END:   Instance {:?} unlock {:?}", std::env::var("INSTANCE"), file);
+        out
     }
 
     pub(super) fn error_contended(err: &Error) -> bool {

Johnathan Van Why · Answer 2 · Sat Feb 05 2022 07:53:24 GMT+0800 (China Standard Time)

So the current status is that building for different platforms with the same target triple causes races.

Johnathan Van Why · Answer 3 · Fri Aug 04 2023 06:06:12 GMT+0800 (China Standard Time)

I'm trying to think through whether it is possible to solve this race condition, and if so, how.

First off: we only have a race if cargo run is executed concurrently with another cargo command that builds the same binary for the same --target. cargo build invocations cannot race with each other because they always hold locks through the entire operation.

If cargo run is invoked concurrently with cargo run for the same binary and --target, the following interleaving can happen:

LIBTOCK_PLATFORM=A cargo run acquires the lock, builds the ELF for platform A, then releases the lock.
LIBTOCK_PLATFORM=B cargo run acquires the lock, builds the ELF for platform B, then releases the lock.
LIBTOCK_PLATFORM=A cargo run invokes the runner binary, which invokes elf2tab on the ELF file.

This results in the LIBTOCK_PLATFORM=A cargo run invocation creating a TAB file build for platform B, which is incorrect. I do not see how we can defend against this, except to avoid running cargo run concurrently for different platforms.

So at the minimum, we should document that cargo run invocations for different platforms can race, and tell users to not do that.

Additionally, there are a couple ways we could prevent this race for libtock-rs's Makefile:

Use a mutex (e.g. via GNU Parallel's sem binary) to prevent concurrent cargo run invocations.
Pass a different --target-dir= value for each platform.

Of course option 3 (document + ignore) exists as well.

Pat Pannuto · Answer 4 · Fri Aug 04 2023 06:27:20 GMT+0800 (China Standard Time)

Pass a different --target-dir= value for each platform.

This is effectively what the libtock-c build system does. There the requisite granularity is core type, (i.e. separate builds for cortex-m0, -m3, -m4, -m7, and each flavor of risc-v), Most of the compilation for any app is the libtock library, so the ecosystem sees reasonably appreciable benefit from sharing platform build directories / re-using libtock library objects.

Reproducible builds are the other important part of the puzzle there, as it doesn't really matter if parallel build processes compile the same library file for a given platform in a racy way, as they'll both produce the same thing.*

I very intentionally set up CI to stress test this—the build_all.sh script will build as many examples in parallel as there are cores on the machine. It seems to be working pretty well.

*Caveated on built object files being written / updated atomically such that another build process wouldn't try to read a partially written file; which compilers seem sensible on in practice]

Johnathan Van Why · Answer 5 · Fri Aug 04 2023 06:30:01 GMT+0800 (China Standard Time)

Note that in the context of #482, this means passing a different --target-dir value for each location (flash address and RAM address) you build the app for.

Pat Pannuto · Answer 6 · Fri Aug 04 2023 06:51:29 GMT+0800 (China Standard Time)

Sigh, yep, that's how we end up with this list of "targets" https://github.com/tock/libtock-c/blob/432746cd7df195a6122d0d3f0e88a16720d5376d/Configuration.mk#L80-L94

I (optimistically?) view the spurious builds per-location as a transient hack that can be removed when relocations land.

Johnathan Van Why · Answer 7 · Fri Aug 04 2023 07:00:57 GMT+0800 (China Standard Time)

ePIC makes everything better for sure.

sigh

Brad Campbell · Answer 8 · Wed Aug 09 2023 02:47:07 GMT+0800 (China Standard Time)

cargo build...locks the build directory before it starts building the first crate, and does not unlock it until it is finished with the last crate

Which directory? /target or something more nested?

Johnathan Van Why · Answer 9 · Wed Aug 09 2023 03:27:45 GMT+0800 (China Standard Time)

cargo build...locks the build directory before it starts building the first crate, and does not unlock it until it is finished with the last crate

Which directory? /target or something more nested?

I don't recall, and I don't have that cargo dev setup around anymore so it's not trivial to check. I suspect it is per-architecture and per-profile based on

jrvanwhy@jrvanwhy:~/libtock-rs/target$ find . -iname '*lock*'
./release/.cargo-lock
./riscv32imc-unknown-none-elf/release/.cargo-lock
./thumbv6m-none-eabi/release/.cargo-lock
./thumbv7em-none-eabi/release/.cargo-lock

If it's worth putting more effort into answering your question, let me know and I will recreate my test setup and give you a confident answer.

Brad Campbell · Answer 10 · Wed Aug 09 2023 03:28:26 GMT+0800 (China Standard Time)

Well is it possible that cargo could build two elfs at once but for different target triples?

Johnathan Van Why · Answer 11 · Wed Aug 09 2023 03:38:18 GMT+0800 (China Standard Time)

Well is it possible that cargo could build two elfs at once but for different target triples?

A quick test shows that it is possible to build two ELFs at once if one uses the release profile and one uses the debug profile (whether or not they are different target triples). So far, I haven't been able to get it to build two release ELFs simultaneously, even with different target triples.

Johnathan Van Why · Answer 12 · Wed Nov 22 2023 06:51:03 GMT+0800 (China Standard Time)

At this point, I believe we've fixed all race conditions that are worth fixing. The remaining race conditions I'm aware of are:

Invoking multiple make commands in parallel can race
Invoking make concurrently with cargo run can race
rustup can potentially race with itself anytime

Issues 1 and 2 are pretty much baked into the designs of make and cargo run, while issue 3 is tracked upstream in rust-lang/rustup#988.