Set up periodic Chromium indexing job

Question

Set up periodic Chromium indexing job

varungandhi-src opened this issue a year ago · comments

We have a Buildkite runner provisioned which is powerful enough to be able to index Chromium in reasonable time. https://github.com/sourcegraph/infrastructure/pull/4910

The build machine is stateful, which is both good and bad. The good is that we:

Only need to incrementally clone newer changes in Chromium
Only need to incrementally build the code
Only pay for the disk when the VM is stopped.

The bad is that we may run into problems due to build dependency issues as Chromium's build scripts try to install system dependencies.

The basic workflow will look like this:

One-time setup:
1. Clone depot_tools.
2. Clone Chromium.
3. Make sure depot_tools is available on PATH.
4. Install system dependencies, including Python 3.8 (there were issues with Python 3.10)
Pipeline setup: (i.e. every run)
1. Update depot_tools. (git pull origin main --ff-only)
2. Update the checkout
3. Re-run the build. Q: Does gn need to be reinvoked here if any of the build files have changed? Or is re-running ninja sufficient?
  - When running ninja, use -k 0 to keep going in the presence of errors, and send a Slack message if ninja runs into errors.
  - Collect statistics about memory usage when running this.
4. Delete useless artifacts find out/X -regextype egrep -regex '.*\.(apk|apks|so|jar|zip|o)' -type f -delete
5. Download the latest release of scip-clang.
6. Run the indexer.
  - When running the indexer, if there are any warnings or errors printed, send a Slack message.
  - Collect statistics about memory usage when running this.
7. Download the latest release of src-cli.
8. Upload the index
9. Delete the index
10. Print statistics related to memory usage.
11. Delete src-cli and scip-clang.

If there is a failure at any step, we should send a Slack message to an internal channel with a link to the Buildkite job log.

Varun Gandhi · Answer 1 · Fri May 19 2023 20:21:42 GMT+0800 (China Standard Time)

Some notes based on my convo with William:

We can break the job into 3 steps.
1. Starts the GCP instance (runs on stateless agent -- the stateless agent has the GCP CLI pre-installed)
2. Runs the indexing job (on the stateful/powerful agent). Main caveat here is we need to pass in a secret here which lets us upload the index to Sourcegraph.com, but we can figure out how to resolve that once we get to that stage.
3. Stops the GCP instance (runs on stateless agent)
There is a way in the Buildkite UI under 'Edit Steps' which lets us modify the main buildkite command, where we can point it to another pipeline file.

Example of non-trivial pipeline magickery: https://github.com/sourcegraph/sourcegraph/tree/wb/app/aws-macos

Dominic Cooney · Answer 2 · Fri May 19 2023 21:13:56 GMT+0800 (China Standard Time)

Update depot_tools

gclient does this, and you should run gclient sync to pull updated dependencies anyway. IIRC depot_tools or gclient update—forget exactly which—also fetches some Python environments from something called CIPD. Its infra can be a bit flaky but if you want to work around it, it is a lot of work.

Q: Does gn need to be reinvoked here if any of the build files have changed? Or is re-running ninja sufficient?

In general, gn does not need to be reinvoked. That said, Chromium has a system called landmines for clobbering certain bots. So... YMMV? In case of repeated failures you might like to start from "scratch" (you probably don't need to reclone, but you could blow away your build directory and git reset --hard HEAD --ffxd && gclient sync --force, something like that.)

Do you need to build trunk to have the index in a good state?

Delete useless artifacts ...

What are the useful artifacts? I'm wondering if you can get away with building a lot less.

Varun Gandhi · Answer 3 · Mon May 29 2023 09:53:34 GMT+0800 (China Standard Time)

Do you need to build trunk to have the index in a good state?

From the indexer's perspective, it doesn't matter which exact commit it is, but we'd like to regularly index newer commits rather than purely regression testing against a pinned commit.

What are the useful artifacts?

Anything that's needed to type-check in-project C++ files. Largely this would be generated headers, but not generated C++ files (or files in other languages).