CI failure issues

Question

CI failure issues

apostasie opened this issue 2 months ago · comments

Description

Right now it is really hard to get green build on first try, as many things seem to fail unrelated to submitted code.

This ticket is meant to collect the issues I have been seeing and that we should be able to do something about relatively easily:

docker hub returns 429 (too many requests)
deb debian drops the connection

Steps to reproduce the issue

Submit a PR

Describe the results you received and expected

Fail.

Not fail. :-)

What version of nerdctl are you using?

n.a.

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

No response

Apostasie · Answer 1 · Thu May 30 2024 05:10:25 GMT+0800 (China Standard Time)

Reading through the Dockerfile and experimenting with github actions, take-away notes:

using actions/cache for buildkit caching is not practical - if we are to do buildkit level caching, we should export the cache to a registry - either way, this is not likely to deliver significant improvement given the size of the cache we would have to retrieve / store
using actions/cache for a local mirror registry would be flimsy / hard to get right - furthermore, it is not likely to bring significant speedup for the build phase - whether that would be more significant for overall test time is unknown
currently, we fan out along a two dimensions matrix (ubuntu version / containerd version) and build everything from scratch on each item - the issue here is not so much github compute resources (don't care yet), but rather the fact that every build will hit both Docker Hub and Debian repositories - if we were to reduce the number of times we are rebuilding everything (not nerdctl), we would indeed reduce the quantity of builds failing because Docker Hub is 429, or when Debian repos are dropping us
we could further leverage caching to store the "dependency base image" and cut all these builds from "n * k per run" to "k a day" (where k = number of different containerd versions).

Current train of thoughts is to:

change slightly the dockerfile so that we have a clearly separated "dependencies build" part, that we can either build, or use a preloaded image for
change the actions flow so that we load cache, build if need be, save cache, and save artifacts for downstream jobs

Current testing is showing about 25% time saved per PR once the cache is hot, and of course a significant reduction in failures derived from third-party provider dropping on us.

Apostasie · Answer 2 · Fri May 31 2024 03:16:57 GMT+0800 (China Standard Time)

Further notes about researching GH actions and our overall testing:

for some reason, uploading artifacts and using them in a subsequent task is routinely "cancelled" - not clear if we are hitting some kind of a resource limit - also, uploading artifacts costs about 1 minute and downloading about 15 seconds. This seems to be a lot compared to cache. As an alternative, we could use only cache and bring the build back into the actual test-integration task (as cache might get evicted). That will likely cause conflicts between multiple competing builds though.
IPFS tests are the long pole, especially in a context where they fail a lot and because we retry when they do. We could split them out on their own node maybe?
- cmd/nerdctl.TestIPFSComposeUp/overlayfs (33.34s)
- --- FAIL: TestIPFSComposeUp/stargz (91.22s)
- --- FAIL: TestIPFSComposeUp (136.00s)
- total about 4 minutes just for these three
in that context ^ we could split out tests in groups, using go test --skip regexp - that would alleviate the need to explicitly group them manually