Perf tracking
jonjohnsonjr opened this issue · comments
Opening this as a meta-issue to track low hanging fruit for perf wins.
State of the world before we start optimizing things too much:
$ pwd
/Users/jonjohnson/src/github.com/chainguard-images/images
$ time apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go
...
apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go 27.45s user 4.15s system 242% cpu 13.009 total
After #782
apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go 14.16s user 3.83s system 151% cpu 11.892 total
Mild speedup, but huge reduction in CPU usage.
You can see the relative length of BuildLayer
shrinking vs buildImage
.
After chainguard-dev/go-apk#74
apko publish --arch amd64 images/go/configs/1.19.apko.yaml localhost:8081/go 13.42s user 4.45s system 210% cpu 8.477 total
Shaved off another ~3s.
After chainguard-dev/go-apk#75
apko publish --keyring-append --repository-append --arch amd64 13.46s user 4.56s system 227% cpu 7.917 total
Shaved off a little under a second.
Been a while since an update...
Here's a cold cache:
apko publish --keyring-append --repository-append --arch amd64 12.49s user 3.26s system 123% cpu 12.779 total
Here's warm:
apko publish --keyring-append --repository-append --arch amd64 9.03s user 1.56s system 216% cpu 4.882 total
Notably, cold is faster than warm when we started this effort 🎉
I have a branch that gets us down to ~3s on the hot path, but it's a bit of a dead end because it mostly just makes the work we're already doing a little bit more concurrent, which doesn't actually help that much in a build-the-world scenario.
This is HEAD:
![image](https://private-user-images.githubusercontent.com/17863526/253417989-8b74626b-02b3-461b-9c1c-7b7e36e97e2f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTc5ODktOGI3NDYyNmItMDJiMy00NjFiLTljMWMtN2I3ZTM2ZTk3ZTJmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE2Y2NmNmQ1NmI4YTNhYzVlZTc3MWE1ODFiYzk0OTYyOGU4NjljZjRiZWMzN2UwMDNhMWYxNmJhNWE5ODU3NWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.ZpFfmZ6VB0-xAPcqgBEYEbefks67VtoR6ZJt5iCg_yY)
This is my branch:
![image](https://private-user-images.githubusercontent.com/17863526/253418075-40ab0342-d7d3-460e-b612-f3035efb4f64.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTgwNzUtNDBhYjAzNDItZDdkMy00NjBlLWI2MTItZjMwMzVlZmI0ZjY0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY3ZTJhOTM1OTdmYzM5OThmMGFjM2Y2M2Y2NGI0ZjdkMjNiMjc2MTUwMWRkMjU1ZWZiMGI3YzNjMzBhYzJjNWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.D1Lhc8KZq8dPn31QuLPyf3CpigLMoXoDXBRchy7TYWE)
At least in these two flamegraphs, the exact same 9.26
seconds of CPU time is getting done.
Looking at where we're spending that time...
About a third of our CPU time is in pgzip compressing the final layer. Since we're doing a parallel compression, this only takes ~850ms, so that's about the speed of light for us on a hot path:
![image](https://private-user-images.githubusercontent.com/17863526/253418460-d2d7c2d7-65bd-4d8c-b04d-a194e3cc3b53.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTg0NjAtZDJkN2MyZDctNjViZC00ZDhjLWIwNGQtYTE5NGUzY2MzYjUzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgyYWM5YmI2NTczZDBlOWE1MDM4NjkxZjc1YTBmNzgxMDBkOWRkN2MwYjBjNGZkY2VmM2M2Yzc3YWVlNGJhOTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SfBfJ-qKAeXrfyCCppu5lsmdlbCQlxX3O0MhSn-FAdY)
We spend 1.5 serially writing things to disk and walking the filesystem to read them back from disk:
![image](https://private-user-images.githubusercontent.com/17863526/253418688-3f7bb5fa-561d-4e15-85f9-888c73c07447.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTg2ODgtM2Y3YmI1ZmEtNTYxZC00ZTE1LTg1ZjktODg4YzczYzA3NDQ3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNjMmU1MDM3MjFhZjRhNTMzM2MwZTUxOWYzNDI0YzAzNDk3ODczY2QwNDY5ZDNiMzk5YTdlNTRhMzc0ZWJkYjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.DjaX0Vf_tsRK3AsCDJehHBtW2ue7jMwCks0unf8RbjI)
Meanwhile, we are gunzipping the data section of each APK, so that we are paying that time 2x (just concurrently):
![image](https://private-user-images.githubusercontent.com/17863526/253419056-7e34ae4a-4260-4c73-ba12-bbfddb1656b6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTkwNTYtN2UzNGFlNGEtNDI2MC00YzczLWJhMTItYmJmZGRiMTY1NmI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJlODM5MGFjY2QzOGUzOWNkOTE1ZmFmOTI4NTZkOGVlN2IzM2E1MmY4MTkyMWM2MWQzZDJiMzJkNjIyNzAxZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.5sO4pJO-NAMPICK2yiyr4X9QYvm6lsoppzCVYH0wz08)
A bit of a surprising result is that we spent a third of a second just cleaning up the temporary directory we created:
![image](https://private-user-images.githubusercontent.com/17863526/253419172-2a190d3a-b570-4e06-8258-2bf622319fa5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQxNzAyNTYsIm5iZiI6MTcwNDE2OTk1NiwicGF0aCI6Ii8xNzg2MzUyNi8yNTM0MTkxNzItMmExOTBkM2EtYjU3MC00ZTA2LTgyNTgtMmJmNjIyMzE5ZmE1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAyVDA0MzIzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU3YmY1NGUyZjFhM2FhZWE5YzQxNTg3YTlhMWI3NWRjODEwZTE3MjZkOTNiNGMyY2FmNmUxODIwZGQwNTRlMGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.YRJTebewH71vP74bR01Fh2mODchl8-PYqlgKS44e_1M)
Then we spend a surprising amount of time pushing images, but that's mostly because docker desktop won't stop touching my config file 🙄
cat ~/.docker/config.json | jq .credsStore
"desktop"
If we drop that, things look a little better here:
The SBOM generation is still pretty slow. I'm going to see if I can shift some of that left, but I managed to cut 1/3 of it in #801
Anyway, looking back at where we're spending our time, very roughly:
1s writing a bunch of files to disk
0.5s reading a bunch of files from disk
0.5s cleaning up a bunch of files from disk
1s compressing things and other stuff
I have a plan to index the data section of APKs (really, just extract the tar headers) when we download them for the first time, then use that to avoid writing everything to disk. Instead, we can figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)... then we just (in parallel) compress the subset of files from each APK that would have ended up in the final layer and at the very end append all these gzipped tarballs together with a bunch of metadata we compute along the way.
It should look something like this:
1s writing a bunch of files to disk
0.5s reading a bunch of files from disk
0.5s cleaning up a bunch of files from disk
1s compressing things and other stuff
That last bit of compressing stuff will now happen even more concurrently than with pgzip, so I'm guessing that will bring us well under a second (on a hot path).
The next step after that would be to write some fun software that takes advantage of some details in DEFLATE to much more efficiently modify/recompress the existing APK's data section, which would shrink that latency by ~4-5x and get us closer to 250ms, at which point it will make some sense to revisit where we are spending our time.
So I'm not sure that we have the same performance constraints here, but you may find pantsbuild/pex#2175 interesting, especially the medusa-zip
tool to rearrange zip files really fast at https://github.com/cosmicexplorer/medusa-zip. It's not quite the same thing as taking advantage of DEFLATE, but one extreme crime I have performed is the hackery to read out the contents of a zip archive into another one without touching the file stream at all: https://github.com/cosmicexplorer/zip/blob/94c21b77b21db4133a210f335e0671f4ea85d6a0/src/read.rs#L331-L392. The zip format was made for messing around like this; I would love to see more crimes against DEFLATE too.
This isn't exactly relevant except that I happened to be working on it at the same time as the above, but in pypa/pip#12184 (comment) I demonstrate the performance impact of creating a local index for pip which gets lazily updated as it crawls dependencies. Since I believe we discussed one result of this being the publication of indices for positions referencing some other compressed targz stream, I wanted to note instead that in a related but different application, I was able to generate local indices for resources as they were crawled, amortizing that transformation per-node. I would recommend trying that approach first here if you haven't solved the problem already by now.
It would also seem very much within the scope of something like medusa-zip
to handle the creation of such indices in a streaming manner when a targz is first downloaded.
The following is mostly a note to self:
One additional concern that arose from the zip merging solution investigated in pantsbuild/pex#2158 and pantsbuild/pex#2175 was that (as initially proposed) merging zip files from a shared cache would also take up more disk space than before (to create the cached zips). While handling that cache is in one sense an application-specific issue (see pantsbuild/pex#2201), if we also expand this medusa-zip
archive library's capabilities to cover targz merging via creation of decompressed indices (and therefore hand over responsibility for the lifecycle of filesystem cache entries to that library/service), we could have it handle cache eviction/etc for the local entries it handles.
I'll create a separate issue if I have further thoughts on any of this and stop derailing this thread.
Although, regarding this approach in particular:
I have a plan to index the data section of APKs (really, just extract the tar headers) when we download them for the first time, then use that to avoid writing everything to disk. Instead, we can figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)... then we just (in parallel) compress the subset of files from each APK that would have ended up in the final layer and at the very end append all these gzipped tarballs together with a bunch of metadata we compute along the way.
In order to execute build processes in isolated chroots that can be cached and executed remotely via the bazel remexec api, pants maintains a virtual filesystem consisting of merkle trees stored in an LMDB content-addressed store, which can be efficiently synced against a remote database (since the db only contains a mapping of (checksum) -> (byte string)
, and entries are stored as encoded protobufs). It exposes this to build tasks with a pretty novel API.
Your problem here can be solved without the global deduplication that pants performs, but I wanted to mention how encoding directory contents into merkle trees is a useful general approach for performing (as you said) "figure out what we would write to disk, what files would get overwritten by what would be written to disk by subsequent APKs, and also what files would be affected by the apko config stuff (chmod and whatnot)...". This act of normalization into a db-friendly format (in pants's case, converting directory trees into protobufs referencing other entities by checksum) may be the link that lets us meaningfully generalize this into a library, one which:
- efficiently reads/normalizes zips/targz into a local LMDB store
- encoded into protobufs like pants
- efficiently computes the result of superimposing/transforming a sequence of normalized directory trees,
- without filesystem operations; this is what pants does with
Digest
andSnapshot
- without filesystem operations; this is what pants does with
- has methods to efficiently export a normalized directory tree into zip/targz
- not done in pants, but see e.g. pantsbuild/pants#19049 for similar optimizations
After chainguard-dev/go-apk#98
Cold
~11s -> 4.9s
This came mostly from being able to fetch and decompress in parallel, which speeds up the installation phase.
Hot
~4.2s -> 2.6s
We still have that faster install phase but we get to skip the fetch phase entirely.
After #860
Using --offline
flag (can't do this cold) on the hot path saves ~200ms mostly from avoiding TLS handshake at the beginning.
With #867
Building cgr.dev/chainguard/go
for amd64.
cold
4.9s -> 3.7s
We are mostly limited here by how quickly we can fetch and decompress each APK. We definitely leave some performance on the table by limiting our concurrency during that phase.... maybe worth looking into.
hot
2.6s -> 1.5s
We spend most of our time now in pgzip with a bit of time burned doing TLS handshakes at the beginning and SBOM generation (giant JSON document rendering) at the end.
offline
2.4s -> 1.3s
The next phase is to take this (CPU) hungry hungry pgzippopotamus and replace it with something that can go faster with less CPU. I'd even be with with a slightly slower implementation that would use much less CPU.
There is a particularly ambitious optimization we can perform where we could stitch together pre-existing DEFLATE streams when we know that their decompressed contents are identical, which would let us reuse the CPU-intensive parts of compressing all these files.
Ensuring that the decompressed contents are identical is very difficult in the general case, but we can skip that difficulty by taking advantage of APK checksums where we already do know that the contents are identical. This requires writing a custom DEFLATE encoder, which might be out of reach for the amount of time I have here, but I want to write it here for posterity in case I come back to it.