Set up periodic Chromium indexing job
varungandhi-src opened this issue · comments
We have a Buildkite runner provisioned which is powerful enough to be able to index Chromium in reasonable time. https://github.com/sourcegraph/infrastructure/pull/4910
The build machine is stateful, which is both good and bad. The good is that we:
- Only need to incrementally clone newer changes in Chromium
- Only need to incrementally build the code
- Only pay for the disk when the VM is stopped.
The bad is that we may run into problems due to build dependency issues as Chromium's build scripts try to install system dependencies.
The basic workflow will look like this:
-
One-time setup:
- Clone depot_tools.
- Clone Chromium.
- Make sure
depot_tools
is available on PATH. - Install system dependencies, including Python 3.8 (there were issues with Python 3.10)
-
Pipeline setup: (i.e. every run)
- Update
depot_tools
. (git pull origin main --ff-only
) - Update the checkout
- Re-run the build. Q: Does
gn
need to be reinvoked here if any of the build files have changed? Or is re-runningninja
sufficient?- When running
ninja
, use-k 0
to keep going in the presence of errors, and send a Slack message ifninja
runs into errors. - Collect statistics about memory usage when running this.
- When running
- Delete useless artifacts
find out/X -regextype egrep -regex '.*\.(apk|apks|so|jar|zip|o)' -type f -delete
- Download the latest release of
scip-clang
. - Run the indexer.
- When running the indexer, if there are any warnings or errors printed, send a Slack message.
- Collect statistics about memory usage when running this.
- Download the latest release of
src-cli
. - Upload the index
- Delete the index
- Print statistics related to memory usage.
- Delete
src-cli
andscip-clang
.
- Update
If there is a failure at any step, we should send a Slack message to an internal channel with a link to the Buildkite job log.
Some notes based on my convo with William:
- We can break the job into 3 steps.
- Starts the GCP instance (runs on stateless agent -- the stateless agent has the GCP CLI pre-installed)
- Runs the indexing job (on the stateful/powerful agent). Main caveat here is we need to pass in a secret here which lets us upload the index to Sourcegraph.com, but we can figure out how to resolve that once we get to that stage.
- Stops the GCP instance (runs on stateless agent)
- There is a way in the Buildkite UI under 'Edit Steps' which lets us modify the main buildkite command, where we can point it to another pipeline file.
![image](https://user-images.githubusercontent.com/93103176/239529984-f910d6d9-cf3b-4c15-8afb-71cdae1e928b.png)
Example of non-trivial pipeline magickery: https://github.com/sourcegraph/sourcegraph/tree/wb/app/aws-macos
Update depot_tools
gclient
does this, and you should run gclient sync
to pull updated dependencies anyway. IIRC depot_tools
or gclient update—forget exactly which—also fetches some Python environments from something called CIPD. Its infra can be a bit flaky but if you want to work around it, it is a lot of work.
Q: Does gn need to be reinvoked here if any of the build files have changed? Or is re-running ninja sufficient?
In general, gn
does not need to be reinvoked. That said, Chromium has a system called landmines for clobbering certain bots. So... YMMV? In case of repeated failures you might like to start from "scratch" (you probably don't need to reclone, but you could blow away your build directory and git reset --hard HEAD --ffxd && gclient sync --force
, something like that.)
Do you need to build trunk to have the index in a good state?
Delete useless artifacts ...
What are the useful artifacts? I'm wondering if you can get away with building a lot less.
Do you need to build trunk to have the index in a good state?
From the indexer's perspective, it doesn't matter which exact commit it is, but we'd like to regularly index newer commits rather than purely regression testing against a pinned commit.
What are the useful artifacts?
Anything that's needed to type-check in-project C++ files. Largely this would be generated headers, but not generated C++ files (or files in other languages).