NixOS / ofborg

@ofborg tooling automation https://monitoring.ofborg.org/dashboard/db/ofborg

Home Page:https://ofborg.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Donate x86_64 compute

jonringer opened this issue · comments

I built a large home server to help with doing reviews (and I just wanted a server.... covid 19 does weird things). Wonder if i could donate some of it's power for doing ofborg evals. I've noticed that they are very slow now

$ neofetch --off
jon@nixos
---------
OS: NixOS 20.09 (Nightingale) x86_64
Host: TRX40 AORUS MASTER -CF
Kernel: 5.4.59
Uptime: 10 days, 17 hours, 50 mins
Packages: 903 (nix-system), 1516 (nix-user)
Shell: bash 4.4.23
Terminal: /dev/pts/2
CPU: AMD Ryzen Threadripper 3990X (128) @ 2.900GHz
Memory: 104842MiB / 257679MiB

I can give ssh info if this is acceptable

the nix store is on a separate zfs pool with dedup and compress on, which gives me around 5x savings

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   692G  1.14T        -         -    37%    37%  2.66x    ONLINE  -
$ zfs get all nixstore | grep compress
nixstore  compressratio         1.90x                            -
$ zfs list nixstore/store
NAME             USED  AVAIL     REFER  MOUNTPOINT
nixstore/store  1.71T  1.05T     1.71T  legacy

not sure what a good metric for package building power would be:

$ time nix-build -A llvm --cores 128 --check
...
/nix/store/ybb1yp0kingbj9k9s3462qsiszmkqnyp-llvm-7.1.0

real	2m58.209s
user	0m0.640s
sys	0m0.954s

cc @grahamc

(He's the only one who can add builders, since it requires access to infra secrets.)

I'll comb through the codebase tomorrow-ish and try to compile a list of commands one could run to test evaluation speed, for another point of data. Basically, all of these:

fn evaluation_checks(&self) -> Vec<EvalChecker> {

I could be wrong but I think evaluations intentionally only run on managed infrastructure. Stability is much more important for that and the requirements around memory, disk io and inodes is pretty specific when running lots of evaluations.

If evaluation is slow we can scale back up to 3 (or more?) evaluator nodes.

@LnL7 Hmm, you have a point, I don't think the evaluation jobs would ever reach my server. But the @ofborg build <pkg> might. This could help with staging-next, at least for x86_64. Anyway, I would like to have ofborg be as responsive as it was a year ago. And felt less of a need to add a github action around ensuring that it was completed because it was usually completed by the time someone reviewed a pr

If you asked me which package I hate rebuilding most, I would answer «chromium».

from my experience, chromium has large parts of the build process which are single threaded, so it doesn't benefit as much from large core counts.

I could be wrong but I think evaluations intentionally only run on managed infrastructure. Stability is much more important for that and the requirements around memory, disk io and inodes is pretty specific when running lots of evaluations.

RAM: 256 GB (The most I could get with consumer grade components) Trident Royal Z

For disk IO, the nixstore partition is on a 2TB Sabrent m.2 (zfs, compress=lz4, dedup=on)
My root partition is on a 500GB sabrent m.2 (ext4)

Each card is capable of doing around 500k IOPS.

I can't have the same stability guarantees as a datacenter, but you can see that the server has been running for 11 days. During peak summer I was turning it off during the day to save on heating, but it should be able to run constantly for the next 8 months.

The stability aspect was another reason i was pitching it as a build machine :). if the nix-daemon is unable to connect to a build machine, it won't kill the job, it will just get re-assigned. So it's relatively harmless outside of the build capacity dropping.

For cooling, I have the cpu in a custom water cooling loop with 2 360mm rads. Even under full cpu load (~250-300W), I've never seen core temps get above +35C from ambient. (65C during the summer).

I guess cooling isn't as important, but for stability purposes and throttling it is.

Stability is much more important for that and the requirements around memory, disk io and inodes is pretty specific when running lots of evaluations.

I only started hitting a wall around ~24 concurrent evaluations, but that's because of sqlite lock contention (from the nix daemon).

I was playing around with evaluation and derivation requirements in a side project https://github.com/jonringer/basinix

@jonringer you need to send us a picture of the beast of a machine!.

20200423_225835.jpg

Not the best photo, I have it hidden now.

This was when I was building it and rinsing the loop.

Unfortunately the mobo shown had a defective ram slot, so now I have a different mobo. And added another rad since then.

I heard talk of untrusted builders but not seen any developments...but maybe .... just maybe. it could work for evaluations , but for contributing that 'cpu work' to the nixos cache.... i doubt it......

nixos hashes as you know, are summed from the input, not the outputs..... however if a build is completely reproducible, byte for byte, then a distributed build and caching process could be implemented....

it would be great if people could contribute compute resources to build software, but there's always the options that a untrusted builder could introduce code that is not the sum of it's inputs.... ;-)

However, if you got 10 builders , and and they compute output hashes, if 9 of them agree... but that requires that builds would be totally reproducible byte for byte. (not everything is). Builds can be non-deterministic. :-(

Maybe it's easier to hash the output logs, rather than the resultant compiled code, just for evaluations, so a misbehaving builder(evaluator) can be identified.

It's all very interesting, and nix probably makes this closer to a reality than any other build system out there....

Maybe when the IPFS nix caches come online , this might stop being 'ideaware', to 'software'.... :-).

p.s. that machine is a beast! Just box it up and send it to @grahamc :-), and he can make sure your eval's get piped to this machine (just for you!). ;-). lol

ofborg seems to be 1000x more responsive now. I think this can be closed.