proposal: limit the number of github actions run (matrix explosion)

Question

proposal: limit the number of github actions run (matrix explosion)

ssbarnea opened this issue 3 years ago · comments

In order to reduce disagreements we had regarding how the GHA CI should be setup for a project, I would like to propose limiting the amount jobs we run by avoiding the exponential matrix explosion problem.

I personally know what having-too-many-jobs can affect a project development as both ansible-lint and molecule were affected by this problem in the past. I do remember seeing at some point >100 jobs running, it was insane to scroll over them to find which one failed.

As a thumb rule, I think that we should aim to limit them to what can be displayed in a single screen, about 15 jobs.

What is test matrix explosion?

Let's look at some matrix dimensions that we may apply to projects:

5 ansible versions: 2.9,... 2.12 and devel
5 python versions: 3.6,... 3.10
3 node versions: 12, 14, 16
3 platforms: linux, macos and windows
3 vscode versions: oldest, stable and testing

There are also about common actions which I did not include in the matrix even if they could be: build, package, test, lint. Usually we focus on test, but I mentioned it here as there are cases were we might need/want more than one. For example we still want to allow developers/contributors to perform linting locally, regardless on which platform they run the linting, even if we only test on one.

If we want to maximize testing, we would get something like:

vscode-ansible or ansible-language-server: 5*5*3*3*3 = 675 jobs
linter/molecule: 5*5*2 = 50+ jobs

Tricks to reduce number of jobs

One trick that we already use in many projects is chaining operations into a single job. This works very well if the tests themselves are not long running.

Benefits

Chaining can reduce total execution time because number of parallel workers is limited for repository and platform
Chaining does reduce costs considerably. Running 100 jobs is 100x more expensive than one.
We already chain many actions already: build-package-test being one of them and in some projects we even included linting.
ansible-lint is chaining the ansible version matrix. Basically you will see test jobs for individual python versions but inside they run a different step for each supported ansible version. If the job fails, you will directly see on which ansible version it did fail. The downside is that you will not have the opportunity to see if it passes on other versions of ansible as jobs stops right away. See https://sbarnea.com/ss/Screen-Shot-2021-09-30-10-09-49.99.png - I find this inconvenience as being lesser than the improvement of displaying only ~13 jobs in UI instead of 40-50.

Downsides

If your tests already take 30mins to run, chaining is not an option.
Chaining more than one dimension may not seem as good idea. Maybe for a case where a second dimension may have only two possible values it could work but clearly not more than this.

Solutions

Obviously that we do not run all of these as it would take many hours and a lot of money. The idea is to perform smart/selective matrix testing and pick the most common use-cases and the extremes. The result does not have to perfect but it will provide a very good level of confidence.

Another suggestion is to only use the oldest version supported. For example our folks https://github.com/redhat-developer/ reported that they only run jobs with oldest supported version on node, as more than that would incur costs they cannot afford.

We should keep in mind that while a project is young, it may seem easy to say, few more jobs would not hurt, especially when you are not the one paying the bill. The bigger the organization the longer will take to spot the waste of CI resources. For that reason alone, it better to be conservative about increasing resource usages and avoid doing it until we have proof that lack of them allowed introduction of costly bugs. Even when this happens, it should usually be enough to add only one job covering for the untested-matrix dimension.

Rick Elrod · Answer 1 · Tue Oct 05 2021 23:15:50 GMT+0800 (China Standard Time)

A few quick thoughts:

Each workflow has a cap of 256 jobs so for your 675 job example, that would need to be split across several workflows or something.
I'm not a fan of arbitrary limits like "15 jobs max". If people are adding jobs needlessly, this can be caught/discussed at review time when it happens. But I'd rather have the discussion than say "we're not taking this change because it violates an arbitrary number we picked."
Another thing to consider is ease of finding information in output. "Chaining" if I understand correctly, would combine a bunch of would-be jobs into just one As you already mentioned, this makes it harder to know which versions have failed at a glance and generally makes test output harder to read/debug/find at a glance.

In general because of that (especially the last point), I tend to be more in favor of having more jobs.

I guess I'm a little confused, too:

it was insane to scroll over them to find which one failed.

but what you're proposing with chaining would make it even harder, if I understand right, since you'd have to scroll over hundreds or thousands of lines of output (from the chained/combined job), find the failure, scroll up to see which set of versions failed, etc. Wouldn't this make the problem worse?

Sorin Sbarnea · Answer 2 · Tue Oct 05 2021 23:28:40 GMT+0800 (China Standard Time)

@relrol Chaining done well does not degrade browsability. Shortly to chain correctly 3 tox jobs you would need to run each tox env into a separated step, if you do tox -e a,b,c you will get poor output with likely more to scroll to see the error.

When you run each one as separated test, GHA will collapse the output from previous steps and browsing will be as easy as a normal job. Look at screenshot below:

That was taken from https://github.com/ansible-community/ansible-lint/pull/1642/checks?check_run_id=3766116634 page.

The chaining approach was used by ansible-lint project for X months without any real issues. If we were not to use that, each change would likely take >1h to test instead of less than 20min, mainly due to the extra resources usage, growing the number of jobs from 11 to 21 (or even 27). We do not have unlimited resources and we need to be smart to get the most from the ones we have without impacting the ability to test and merge changes.

In fact on many projects, including the linter, we also use the fail-fast strategy where we cancel any running jobs when one of them fail, freeing resources for other changes that may be waiting in the queue.

John Barker · Answer 3 · Wed Oct 06 2021 00:32:04 GMT+0800 (China Standard Time)

Thanks for raising this, and defining the test matrix options.
Personally, I've set a full mix of oldest and newest (and only minimal testing of inbetween).

Sviatoslav Sydorenko (Святослав Сидоренко) · Answer 4 · Wed Oct 06 2021 04:34:05 GMT+0800 (China Standard Time)

FTR I've typed in my comment a few days ago but forgot to send it and now it's lost after the browser crash.
I'll send a longer post when I have time but TL;DR, as I've mentioned on Slack, I fully agree with @relrod.