Decide strategy for CI

Question

Decide strategy for CI

asajeffrey opened this issue 4 years ago · comments

Alan Jeffrey commented 4 years ago

We shouldn't be on taskcluster any more.

Alan Jeffrey · Answer 1 · Tue Sep 15 2020 03:59:33 GMT+0800 (China Standard Time)

@SimonSapin says that each mac CI run of Servo would cost $17 on GitHub actions?

Simon Sapin · Answer 2 · Tue Sep 15 2020 04:24:59 GMT+0800 (China Standard Time)

https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-github-actions says 0.08 USD / macOS minute.

Assuming GitHub’s workers are equivalent to one mac mini from Macstadium, we have a ~40 min build on one worker followed by ~20 minutes of WPT on 9 workers which is 220 minutes.

Simon Sapin · Answer 3 · Tue Sep 15 2020 04:26:02 GMT+0800 (China Standard Time)

This is a very rough estimate, but it gives an order of magnitude. We could consider running much less of WPT on mac, perhaps only WebGL tests, since we also run all of WPT on Linux.

Simon Sapin · Answer 4 · Wed Oct 21 2020 22:13:26 GMT+0800 (China Standard Time)

Here are some of the options with some thoughts on each. There are likely others but I know know as much about them.

TaskCluster on `community-tc`

This is the status quo.

Pros:

Already works
Has a team that maintains it / keeps it running

Cons:

Even estimating the cost of Servo’s resource usage is difficult, since everything runs in cloud accounts shared with other projects
Mozilla won’t support that cost long-term, so Servo will need to move

Self-hosted TaskCluster

Current-generation TaskCluster is designed to have multiple independent instances/deployments of it exist in the world and has some documentation about deploying a new one. Mozilla has two, and apparently at least one other company has one. The Servo Project could host one.

Pros:

Most similar to the status quo, so wouldn’t involve for example fighting build systems to make everything compile for multiple platforms in a different environment
Has built-in auto-scaling of cloud VMs based on demand. (Those cloud providers don’t offer macOS, though.)
Fully open-source, so in the worst case any specific need can be met with "only" engineering effort without being subject to the whims of a vendor
Very flexible, can potentially support any scenario
The Taskcluster team has been very nice and helpful so far

Cons:

Takes some effort to deploy: AWS S3 and Azure Tables and something that can run Kubernetes are required at minimum
Takes some effort to maintain: if something breaks, to apply updates from upstream, etc
No commercial support
The Taskcluster team is small and may not have a lot of availability to help Servo, especially over the long-term
Treeherder is a separate project. Would we want to also self-host it?

GitHub Actions with GitHub-hosted runners

Pros:

Turn-key solution
Has a team that maintains it / keeps it running

Cons:

Price can add up quickly, especially for macOS runners which have 10× cost
Runners have low hardware specs (2 CPU cores, 7 GB RAM, 14 GB storage). Compilation would be slow if it can finish at all, WPT would be very slow unless it’s parallelized aggressively. Higher options are planned but not (publicly) available yet.

GitHub Actions with self-hosted runners

Pros:

Can use hardware with any specs
Could be cheaper, especially on macOS, if utilization is high enough. One month of a 2012-generation quad-core mac mini is 99 USD at Macstadium, which is ~20 hours in GitHub-hosted dual-core macOS minutes.

Cons:

When utilization is low (when Homu’s queue is empty), long-lived runners would site idle but still cost money. GitHub does not provide anything for automatic scaling based on demand. Some folks build complex DIY systems to do it.
GitHub documents:

We recommend that you do not use self-hosted runners with public repositories.

Forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

This is not an issue with GitHub-hosted runners because each GitHub-hosted runner is always a clean isolated virtual machine, and it is destroyed at the end of the job execution.

Untrusted workflows running on your self-hosted runner poses significant security risks for your machine and network environment, especially if your machine persists its environment between jobs.

Currently with TaskCluster we have a similar situation with tasks that run as soon as a PR is opened, before any review. We run those in a separate pool of short-lived workers. But GitHub doesn’t appear to have recommendation besides “don’t”

Josh Matthews · Answer 5 · Thu Nov 19 2020 01:29:43 GMT+0800 (China Standard Time)

We chatted with Pietro about rust-lang's experience with GitHub actions, and it sounds like our concerns about minutes do not apply to non-private repositories. That suggests that we should try creating some Actions workflows for linux/mac/windows builds and see what the cycle time looks like.

Josh Matthews · Answer 6 · Thu Nov 19 2020 01:33:53 GMT+0800 (China Standard Time)

Summary of the conversation:

the biggest pain point for GitHub actions is the security of custom runners. rust-lang runs a fork which rejects builds for PRs to avoid this, and also creates ephemeral VMs as a backup security measure.
the fork causes problems with github's desire to auto-upgrade runners; there's a 4 hour window between new upstream releases and when the non-upgraded runners stop accepting jobs.
RAM limitations on workers are likely to be the biggest concern for migrating Servo
github actions don't have templates for configuration; rust-lang wrote a preprocessor to generate workflows programmatically
rust-lang is very happy with GitHub actions modulo the pain points; they're on a super-special cloud enterprise setup that gives them a lot of builders.
mac builders can sometimes exceed the documented specs if you get lucky

Josh Matthews · Answer 7 · Thu Nov 19 2020 01:34:33 GMT+0800 (China Standard Time)

https://github.com/features/actions
https://github.com/pricing
https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners
https://docs.github.com/en/free-pro-team@latest/actions/reference/usage-limits-billing-and-administration

Alan Jeffrey · Answer 8 · Fri Nov 20 2020 06:28:20 GMT+0800 (China Standard Time)

Ah, interesting... https://about.gitlab.com/solutions/open-source/ provides free CI time for open source projects.

Nuritzi Sanchez (she/her) · Answer 9 · Fri Nov 20 2020 12:36:27 GMT+0800 (China Standard Time)

Ah, interesting... https://about.gitlab.com/solutions/open-source/ provides free CI time for open source projects.

Hi All! I'm the one who runs the GitLab for Open Source program that @asajeffrey mentioned. Through the program, you'd get 50K CI minutes per month for free.

One of the benefits of GitLab CI is that we're really focused on security, so that's baked in as well. Here's an example:

Happy to answer any questions! Just ping me :)

Josh Matthews · Answer 10 · Sun Nov 22 2020 01:21:35 GMT+0800 (China Standard Time)

Github actions workflows examples:

Gabriel Mazetto · Answer 11 · Tue Nov 24 2020 18:49:09 GMT+0800 (China Standard Time)

on the GitHub Actions to GitLab CI comparison:

the biggest pain point for GitHub actions is the security of custom runners. rust-lang runs a fork which rejects builds for PRs to avoid this, and also creates ephemeral VMs as a backup security measure.

Considering GitLab CI for external resources (when using it with GitHub), this is the default behavior, so you don't need to keep a fork of the runner. By default (as a security model) we never run external code (from a fork) within the same context as the project. So an external user will never be able to arbitrarily execute anything using your project's own resources, own runners etc.

When using GitLab CI as part of a GitLab project (which is not the case here), MRs (how we call PRs) always execute on the context of where it originated from. If a fork, it will use that fork's context.

(relevant documentation: https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/)

the fork causes problems with github's desire to auto-upgrade runners; there's a 4 hour window between new upstream releases and when the non-upgraded runners stop accepting jobs.

In our release model, IIRC you can still use an older runner for as long as you want, the side-effect is that it may not support a feature introduced in a new version. And we release new minor releases every 22nd each month as part of GitLab's release cycle (patch releases can happen during the month, but again, not necessary to upgrade as long as whatever is being fixed is not affecting you).

RAM limitations on workers are likely to be the biggest concern for migrating Servo

what are your requirements?

Josh Matthews · Answer 12 · Tue Nov 24 2020 21:56:33 GMT+0800 (China Standard Time)

One rustc process consumes up to 6gb at its peak when compiling Servo,I believe.

Simon Sapin · Answer 13 · Thu Dec 03 2020 19:24:33 GMT+0800 (China Standard Time)

https://coopcoopbware.tumblr.com/post/636411382111272960/taskcluster-ci-for-engineers talks about Taskcluster outside of Mozilla.

Taskcluster required at least two separate cloud providers (AWS and Azure) and a Heroku account to launch.

Over the past year, we removed the need for Azure as a back-end data store and removed the need for Heroku for deployments. Now if you have a Kubernetes environment setup, you’re ready to install Taskcluster. You’ll still need AWS S3 access for artifact storage, but we’re working to make that configurable too.

So this reduces one of the concerns I had.

However later in the post:

Here are some examples of use cases where Taskcluster might make sense for you:

You already have a person or team of people dedicated to your CI pipeline.

[…]

This is the assessment of something working on TC at Mozilla. I feel this is probably the serious downside that makes self-hosted TC not a good fit for Servo.

Manuel Rego Casasnovas · Answer 14 · Thu Nov 02 2023 18:41:29 GMT+0800 (China Standard Time)

@mrobinson we have been doing lots of improvements on the CI this year, do we still need this issue or we can close it?

Martin Robinson · Answer 15 · Thu Nov 02 2023 18:56:51 GMT+0800 (China Standard Time)

I think we can close this as we've standardized on GitHub Actions for the moment. Maybe that's not the best thing forever, and I could see a future where we want to manage our own runners, but for now I think we can move on.

Decide strategy for CI

TaskCluster on community-tc

Self-hosted TaskCluster

GitHub Actions with GitHub-hosted runners

GitHub Actions with self-hosted runners

TaskCluster on `community-tc`