Bizarre Debian Buster DEB pkg test failure

Question

Bizarre Debian Buster DEB pkg test failure

ximon18 opened this issue 3 years ago · comments

The pkg-test (debian:buster) Packaging workflow job is consistently failing with a bizarre error while the Debian Stretch and Debian Bullseye jobs work just fine.

The error message is: (see here)

cannot locate base snap core20: No such file or directory

This appears to have something to do with the Snap daemon which we don't use or care about.

Ximon Eighteen · Answer 1 · Wed Oct 13 2021 18:06:16 GMT+0800 (China Standard Time)

A quick test of the Debian Buster .deb package (produced by the pkg (debian:buster) job in the same workflow) in a Docker container seems to show the rtrtr binary works fine, at least it runs and can report its own version number.

For now we'll just disable the pkg-test (debian:buster) job, hopefully we can work out what is going on here later.

yurtesen · Answer 2 · Sat Feb 05 2022 14:59:37 GMT+0800 (China Standard Time)

When you launch a machine/container. snapd tries to refresh the snaps if the image used was old. If certain commands/apps installed through snap are executed during core refresh, one might end up with an error message like cannot locate base snap core20: No such file or directory.

Ximon Eighteen · Answer 3 · Tue Feb 08 2022 17:03:16 GMT+0800 (China Standard Time)

Thankyou @yurtesen for your explanation, it is very appreciated.

I'm afraid I'm still at a loss to know what this is about. We do not do anything with snap nor I assume does snap care about Docker images. I'm not sure what a "core refresh" is. Is this perhaps an issue with the underlying GitHub Actions hosted runner...?

Are you able to expand on your feedback to indicate how it relates to the workflow that is failing?

Thanks,

Ximon

yurtesen · Answer 4 · Mon Feb 21 2022 03:10:36 GMT+0800 (China Standard Time)

@ximon18 I only found this issue because I had exactly same error in our Gitlab CI environment. I do not know for sure if it applies to your environment as I am not familiar with it. But snap might care about docker images if the docker was installed as a snap package in your test VM/container image. Snap core is what makes snap tick ... -> https://snapcraft.io/install/core20/ubuntu <- So programs installed using snap stop working momentarily when core is being updated/refreshed.

We are using nested LXD containers. But I imagine the issue can happen if a VM or any machine is launched. In our case, as LXD is only available as a snap container, the lxc commands executed were susceptible. If we used docker installed as a snap package, then docker commands would be effected when snap is "refreshing" core.

We solved the issue by adding to /etc/hosts/ the entry 127.0.0.1 api.snapcraft.io in the test machine to stop snap from refreshing packages. Because I found out that snap refresh can't be stopped by design 🤦 If you can't do that, all you need might be having some delay before starting to use docker because usually snap refresh takes 10-20 seconds only.

Ximon Eighteen · Answer 5 · Mon Feb 21 2022 18:41:05 GMT+0800 (China Standard Time)

Thanks @yurtesen.

That's a very enliightening comment, thankyou.

The workflow that failed also uses an LXD container, running on an Ubuntu host provided by GitHub Actions. I don't know if we can edit /etc/hosts on that machine or even if we can if it would take effect early enough, or whether we would even want to as feels like a very hacky solution.

As you indicate that the problem occurs when snap refresh is running I wonder if there is some reliable way to detect if a snap refresh is in progress and to wait until it completes. Our workflow already waits for example until cloud-init completes before proceeding.

Now that you've explained what is going on I at least have a better idea of what to try and investigate, thanks for that!

Ximon

yurtesen · Answer 6 · Tue Feb 22 2022 02:10:07 GMT+0800 (China Standard Time)

@ximon18 I think one other way to resolve the issue would be executing snap refresh on problem machine/script before executing any lxc commands. It blocks until the refresh is completed. I am not able to see the output from the run you linked in your first message unfortunately.

Ximon Eighteen · Answer 7 · Tue Feb 22 2022 02:16:27 GMT+0800 (China Standard Time)

GitHub has deleted the log I linked to unfortunately.

I tried today to reproduce the problem but cannot. If we see it again I'll come back to your useful advice here.

Thanks!

Ximon