servo / project

A repo for the Servo Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decide strategy for CI

asajeffrey opened this issue · comments

We shouldn't be on taskcluster any more.

@SimonSapin says that each mac CI run of Servo would cost $17 on GitHub actions?

https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-github-actions says 0.08 USD / macOS minute.

Assuming GitHub’s workers are equivalent to one mac mini from Macstadium, we have a ~40 min build on one worker followed by ~20 minutes of WPT on 9 workers which is 220 minutes.

This is a very rough estimate, but it gives an order of magnitude. We could consider running much less of WPT on mac, perhaps only WebGL tests, since we also run all of WPT on Linux.

Here are some of the options with some thoughts on each. There are likely others but I know know as much about them.

TaskCluster on community-tc

This is the status quo.

Pros:

  • Already works
  • Has a team that maintains it / keeps it running

Cons:

  • Even estimating the cost of Servo’s resource usage is difficult, since everything runs in cloud accounts shared with other projects
  • Mozilla won’t support that cost long-term, so Servo will need to move

Self-hosted TaskCluster

Current-generation TaskCluster is designed to have multiple independent instances/deployments of it exist in the world and has some documentation about deploying a new one. Mozilla has two, and apparently at least one other company has one. The Servo Project could host one.

Pros:

  • Most similar to the status quo, so wouldn’t involve for example fighting build systems to make everything compile for multiple platforms in a different environment
  • Has built-in auto-scaling of cloud VMs based on demand. (Those cloud providers don’t offer macOS, though.)
  • Fully open-source, so in the worst case any specific need can be met with "only" engineering effort without being subject to the whims of a vendor
  • Very flexible, can potentially support any scenario
  • The Taskcluster team has been very nice and helpful so far

Cons:

  • Takes some effort to deploy: AWS S3 and Azure Tables and something that can run Kubernetes are required at minimum
  • Takes some effort to maintain: if something breaks, to apply updates from upstream, etc
  • No commercial support
  • The Taskcluster team is small and may not have a lot of availability to help Servo, especially over the long-term
  • Treeherder is a separate project. Would we want to also self-host it?

GitHub Actions with GitHub-hosted runners

Pros:

  • Turn-key solution
  • Has a team that maintains it / keeps it running

Cons:

  • Price can add up quickly, especially for macOS runners which have 10× cost
  • Runners have low hardware specs (2 CPU cores, 7 GB RAM, 14 GB storage). Compilation would be slow if it can finish at all, WPT would be very slow unless it’s parallelized aggressively. Higher options are planned but not (publicly) available yet.

GitHub Actions with self-hosted runners

Pros:

  • Can use hardware with any specs
  • Could be cheaper, especially on macOS, if utilization is high enough. One month of a 2012-generation quad-core mac mini is 99 USD at Macstadium, which is ~20 hours in GitHub-hosted dual-core macOS minutes.

Cons:

  • When utilization is low (when Homu’s queue is empty), long-lived runners would site idle but still cost money. GitHub does not provide anything for automatic scaling based on demand. Some folks build complex DIY systems to do it.

  • GitHub documents:

    We recommend that you do not use self-hosted runners with public repositories.

    Forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

    This is not an issue with GitHub-hosted runners because each GitHub-hosted runner is always a clean isolated virtual machine, and it is destroyed at the end of the job execution.

    Untrusted workflows running on your self-hosted runner poses significant security risks for your machine and network environment, especially if your machine persists its environment between jobs.

    Currently with TaskCluster we have a similar situation with tasks that run as soon as a PR is opened, before any review. We run those in a separate pool of short-lived workers. But GitHub doesn’t appear to have recommendation besides “don’t”

We chatted with Pietro about rust-lang's experience with GitHub actions, and it sounds like our concerns about minutes do not apply to non-private repositories. That suggests that we should try creating some Actions workflows for linux/mac/windows builds and see what the cycle time looks like.

Summary of the conversation:

  • the biggest pain point for GitHub actions is the security of custom runners. rust-lang runs a fork which rejects builds for PRs to avoid this, and also creates ephemeral VMs as a backup security measure.
  • the fork causes problems with github's desire to auto-upgrade runners; there's a 4 hour window between new upstream releases and when the non-upgraded runners stop accepting jobs.
  • RAM limitations on workers are likely to be the biggest concern for migrating Servo
  • github actions don't have templates for configuration; rust-lang wrote a preprocessor to generate workflows programmatically
  • rust-lang is very happy with GitHub actions modulo the pain points; they're on a super-special cloud enterprise setup that gives them a lot of builders.
  • mac builders can sometimes exceed the documented specs if you get lucky

Ah, interesting... https://about.gitlab.com/solutions/open-source/ provides free CI time for open source projects.

Ah, interesting... https://about.gitlab.com/solutions/open-source/ provides free CI time for open source projects.

Hi All! I'm the one who runs the GitLab for Open Source program that @asajeffrey mentioned. Through the program, you'd get 50K CI minutes per month for free.

One of the benefits of GitLab CI is that we're really focused on security, so that's baked in as well. Here's an example:

GitLab Security Scanning

Happy to answer any questions! Just ping me :)

on the GitHub Actions to GitLab CI comparison:

the biggest pain point for GitHub actions is the security of custom runners. rust-lang runs a fork which rejects builds for PRs to avoid this, and also creates ephemeral VMs as a backup security measure.

Considering GitLab CI for external resources (when using it with GitHub), this is the default behavior, so you don't need to keep a fork of the runner. By default (as a security model) we never run external code (from a fork) within the same context as the project. So an external user will never be able to arbitrarily execute anything using your project's own resources, own runners etc.

When using GitLab CI as part of a GitLab project (which is not the case here), MRs (how we call PRs) always execute on the context of where it originated from. If a fork, it will use that fork's context.

(relevant documentation: https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/)

the fork causes problems with github's desire to auto-upgrade runners; there's a 4 hour window between new upstream releases and when the non-upgraded runners stop accepting jobs.

In our release model, IIRC you can still use an older runner for as long as you want, the side-effect is that it may not support a feature introduced in a new version. And we release new minor releases every 22nd each month as part of GitLab's release cycle (patch releases can happen during the month, but again, not necessary to upgrade as long as whatever is being fixed is not affecting you).

RAM limitations on workers are likely to be the biggest concern for migrating Servo

what are your requirements?

One rustc process consumes up to 6gb at its peak when compiling Servo,I believe.

https://coopcoopbware.tumblr.com/post/636411382111272960/taskcluster-ci-for-engineers talks about Taskcluster outside of Mozilla.

Taskcluster required at least two separate cloud providers (AWS and Azure) and a Heroku account to launch.

Over the past year, we removed the need for Azure as a back-end data store and removed the need for Heroku for deployments. Now if you have a Kubernetes environment setup, you’re ready to install Taskcluster. You’ll still need AWS S3 access for artifact storage, but we’re working to make that configurable too.

So this reduces one of the concerns I had.

However later in the post:

Here are some examples of use cases where Taskcluster might make sense for you:

  • You already have a person or team of people dedicated to your CI pipeline.
  • […]

This is the assessment of something working on TC at Mozilla. I feel this is probably the serious downside that makes self-hosted TC not a good fit for Servo.

@mrobinson we have been doing lots of improvements on the CI this year, do we still need this issue or we can close it?

I think we can close this as we've standardized on GitHub Actions for the moment. Maybe that's not the best thing forever, and I could see a future where we want to manage our own runners, but for now I think we can move on.