roc-lang / roc

A fast, friendly, functional language.

Home Page:https://roc-lang.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Surgical linker can not link absolute relocations

Anton-4 opened this issue · comments

cargo run examples/platform-switching/main.roc 
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/roc examples/platform-switching/main.roc`
🔨 Rebuilding host...
An internal compiler expectation was broken.
This is definitely a compiler bug.
Please file an issue here: https://github.com/rtfeldman/roc/issues/new/choose
thread 'main' panicked at 'Undefined Symbol in relocation, (+49b6, Relocation { kind: Relative, encoding: Generic, size: +20, target: Symbol(SymbolIndex(+1e6)), addend: +fffffffffffffffc, implicit_addend: false }): Ok(Symbol { name: "", address: +0, size: +0, kind: Section, section: Section(SectionIndex(+8)), scope: Compilation, weak: false, flags: Elf { st_info: +3, st_other: +0 } })', crates/linker/src/lib.rs:2713:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Cause:

app "rocLovesPlatforms"
    packages { pf: "rust-platform/main.roc" }
    imports []
    provides [main] to pf

main = Num.toStr (Num.sin 0)

This does work with --linker=legacy. Could this be a surgical linker issue @bhansconnect or is the true cause coming from somewhere else?

So we are hitting a few things here and I am trying to untaggle the issue so I can fix it.

The main things I am seeing:

  • .data.rel.ro section not supported by surgical linker (easy fix)
  • Absolute relocations not supported by surgical linker (should be easy fix, but haven't verified it works due to later blockers)
  • zig @sin looks to link to a sin function which I assume is the libc sin function. Which, we obviously don't have access to.
  • The zig builtin for sin looks to still call magic functions that the surgical linker doesn't know about. This means that we don't have enough information to link them. So we need to figure out how to properly add this functions or make zig emit them or inline them. I think they are all compiler-rt functions like __divti3, __umodti3, __udivti3, and __fixunsdfti

So I guess the summary is annoying dependency stuff.
Interestingly, optimized build seems to inline everything and fix all of these problems.
Since we ingest the zig as llvm ir, I am not sure how we can fix these dependencies/get the llvm backend to generate them automatically. With builtins-host.o, it looks like we can just run a release build all the time and it will clean up all of those function calls.

@folkertdev any thoughts on this?

we always compile the builtins with ReleaseFast, so the LLVM IR we get from zig does not change. But, we currently directly use the LLVM intrinsic for sin, (i.e. we don't use zig for that at all). Maybe try something like atan? for it we do use zig.

also, zig defines sin in its compiler_rt https://github.com/ziglang/zig/blob/master/lib/compiler_rt/sin.zig so that should not (have to) rely on libc.

Ok. So I think my comment was slightly off. Let me expand.

The final app.o file in a llvm backend build depends on a sin function that the linker should know about. Maybe it is libc, maybe compiler-rt, maybe libm, maybe llvm/lld magically knows how to find it. So in general, the dependency is there, but the surgical linker has no idea how to satisfy it. atan, on the other hand, works totally fine due to not depending on an intrinsic.

Anyway, due to compiling with ReleaseFast, the dependency can be avoid when compiling builtins-host.o. This call to sin will be inlined and removed from the output object. Hmm... Though it only works if we replace @sin with std.math.sin. With just @sin it will still depend on a sin function.
Compiling in ReleaseFast does not fix this for an llvm build. The llvm build will be given zig llvm ir that still includes the call to the llvm sin intrinsic which will get lowered to needing to link to a sin function. Changing from @sin to std.math.sin does not fix this.


As for compiler-rt, maybe we just aren't adding it correctly. It doesn't seem that zig is automatically adding the compiler-rt functions. Instead it assumes we are linking the compiler-rt functions. Which I think is why we manually had to add __muloti4. On some system it was being used but not linked against.

Oh, actually the compiler-rt issue looks to just be that we don't tell zig to bundle compiler-rt. I will put up a fix PR for that.

EDIT: Can't do this until the surgical linker bug below is fixed. Otherwise it pretty consistently breaks the surgical linker.

Oh, .data.rel.ro is a bigger issue than I realized. It isn't just an absolute relocation. It is an absolute relocation + generation of a dynamic relocation to be dealt with by the dynamic linker. At least that is my current understanding. https://www.airs.com/blog/archives/189 has some general info.

So adding more detail here just to offload it from my memory:

For .data.rel.ro:

  1. The section is stored in a RELRO segment. This segment is edited by the dynamic linker when loading the application and then marked read only. It should theoretically be in it's own virtual page so that it can be modified without effecting anything else.
  2. All of the absolute relocations need to be converted to R_X86_64_RELATIVE relocations. This tells the dynamic linker to add the final virtual offset to the address.
  3. These relocations get added to the the .rela.dyn section. (Some minor update to the Dynamic Section info as well). I am pretty sure there can only be one of this section. So we need to merge the roc .rela.dyn with the host .rela.dyn.
  • This may lead to some issues with the surgical linker. It basically means that when doing surgery to build the final app, we may need to be able to add a bunch of bytes randomly in the middle of the file. This just means that we would lose some performance. Currently we can base off of the entire preprocessed host and do in place updates. Instead we would need to copy part of the preprocesed host, add the rela.dyn information, copy the rest of the data, and complete surgery in place. It may be possible to always put this section at the end of the executable to avoid these issues (need to investigate farther, but looks possible (though, might also have a runtime performance cost due to needing to load the end of the executable to dynamically link, but idk))

Note: this should also enable better constants in roc. Currently we can't have a true const string, but fixing this would fix that.

we suspect #3827 is related too

I am finally looking into this. We shall see how it goes. Currently just looking at moving the .rela.dyn section to the end of the binary. Once that is done, it should be easy to extend the section while appending the application code.

Just have to get past all of the segfaults while loading the modified binary first. I debated just duplicating the section instead of moving it to avoid all of the offset modification pains, but the section can actually get quite large. Don't want to force extra bloat into all binaries.

Made some progress. Got a few apps to compile with a moved dynamic section. Then tried to make it work in the general case and broke things. Probably gonna be a good amount of bit twiddling and minor adjustments before I get this move working for all apps. That said, definitely possible given I had it working for one app for a bit. Hopefully, I can find the right combination quickly to get this all passing and updating each address in the correct way. Once that is done, I think expanding the table should be rather simple.

The main annoyance with this work is that I get a lot of SegEnv where the app doesn't load at all, and that is not easy to debug.

So I'm not sure what zig does differently than rust and c, but currently I have some zig hosts working with the section moved to the end of the file. I think the ones that don't work are broken due to shifting of alignment of some data. Which breaks when zig tries to load it using vector instructions. So theoretically zig host could work properly with the change, but they need some vector data alignment based fixes.

As for rust and c, really not sure. They crash before ever getting to any app code. The dynamic linker fails while trying to load them LD_DEBUG=all doesn't seem to print anything useful for them. It just crashes after initting the shared libraries.

Another update. Zig platforms definitely fully work except for the alignment issue for vector instructions (that said, I am still using zig 9). I was able to get a complex zig platform to work. It runs in debug builds where the alignment just happens to be correct. In release builds, it fails because of alignment issues, but the app runs until hitting a vector instruction.

Not sure if that is the place for it, but:
I saw the message b306c9a when trying to run a different code I wrote. For the code below, it panics similarly as for the original issue:

app "day3"
    packages { pf: "https://github.com/roc-lang/basic-cli/releases/download/0.2.0/8tCohJeXMBUnjo_zdMq0jSaqdYoCWJkWazBd4wa8cQU.tar.br" }
    imports [pf.Stdout]
    provides [main] to pf


main = 
    dbg (Num.round 5)
    dbg (Num.floor 5)
    Stdout.line "hi"

Adding --linker=legacy solves this. With the surgical linker, roc dev prints:

thread 'main' panicked at 'Undefined Symbol in relocation, (+1f2, Relocation { kind: PltRelative, encoding: Generic, size: +20, target: Symbol(SymbolIndex(+14)), addend: +fffffffffffffffc, implicit_addend: false }): Ok(Symbol { name: "round", address: +0, size: +0, kind: Unknown, section: Undefined, scope: Unknown, weak: false, flags: Elf { st_info: +10, st_other: +0 } })', crates/linker/src/elf.rs:1415:33

Can you add these cases to the error message b306c9a for the sake of future users?

Thanks for you suggestion @HajagosNorbert, I made #5492 for it.

Sorry for the late reply btw, I just happened to come across this message now.

Just leaving this info here; the json example now produces this error since the zig-11-llvm-16 merge.

+ ./roc_nightly/roc build ./examples/Json/main.roc
Downloading https://github.com/lukewilliamboswell/roc-json/releases/download/v0.3.0/y2bZ-J_3aq28q0NpZPjw0NC6wghUYFooJpH03XzJ3Ls.tar.br
    into /home/runner/.cache/roc/packages

The surgical linker currently has issue #3609 and would fail linking your app.
Please use `--linker=legacy` to avoid the issue for now.

Since I haven't touched this in a long time and I am not sure when I will attempt it again, I want to just write out a list of what I think would be the best way to deal with this as a whole (with my current understanding). Hopefully this information is correct, but this is from very old memory and a partial understanding of the proper solution. My last branch I was working on ended up dying because I 1) got into a mess of offsets and 2) made some wrong assumption around moving sections that lead to me moving things wrong and a lot of breakage. If I were to work on this today, my goals would be:

  1. Rip out the part of the surgical linker that disassembles the binary to avoid using the PLT. Just accept the PLT jump it is not a big deal and the complexity of the disassembler has other issues. Removing it before working on the rest of this will make life simpler. Just remove it and make a PR before starting anything else.
  2. To simplify work in general. Try to keep everything at the same virtual offset when possible even if the file offset changes. This will avoid a lot of relocation shuffling, but is not always possible
  3. Move the entire rela section (if necessary other dynamic related sections) to the end of the elf binary during preprocessing. Before even considering anything else, probably try and get a PR where everything just still works but the section is at the end of the binary. To do this, we probably need an extra program header to load this section to the correct virtual offset.
  4. Once that is working, switch to finally updating the surgical linking stage. It will need to first add extra absolute relocations to our moved rela sections. This will involve expanding the table. It also might require some form of sorting of the table, but I am not fully sure. I just remember seeing strange behavior that made me think sorting might be needed.
  5. Once that works, it is just a matter of appending and patching up like before. Just also filling in the rela table with absolute addresses as you learn them.