gitlab-multi-runner config Sidekick "couldn't execute POST" connection refused

Question

gitlab-multi-runner config Sidekick "couldn't execute POST" connection refused

felixschul opened this issue 6 years ago · comments

Hi all,

First of all many thanks for the awesome work with rancher and the catalog items!

I am experiencing the following problem with the gitlab-multi-runner community catalog item:

When a new instance is started (in my case AWS spot instances), sometimes the "config" sidekick that registers the new runner fails with the following error message:

9.7.2018 17:11:34Running in system-mode.                           
9.7.2018 17:11:34                                                  
9.7.2018 17:11:39ERROR: Registering runner... failed                 runner=FKPDjL73 status=couldn't execute POST against [URL]: dial tcp: lookup gitlab.ambient-innovation.com on 169.254.169.250:53: read udp 10.42.109.212:55972->169.254.169.250:53: read: connection refused
9.7.2018 17:11:39PANIC: Failed to register this runner. Perhaps you are having network problems

This happens only sometimes on some of the newly started instances. When I start the config sidekick again, everything works fine.

My assumption is, that the sidekick executes the POST request a little too early, before rancher has fully built the network for the new instance. Or in other words the scheduler starts the new container (and its sidekick) before the network is fully ready. This might be related to rancher/rancher#2621

Does anyone have any idea on how to fix this or work around this? We shut down spot instances and start new ones very frequently, so this is really a problem and a manual solution (starting the failed sidekicks manually) is not an option for me. Any help is greatly appreciated.

Further information:

Rancher Server v1.6.18
Gitlab multi runner v10.4.0
The servers are AWS t2.medium instances and run on Rancher OS v1.2.0

Raúl Sánchez · Answer 1 · Thu Jul 12 2018 18:01:39 GMT+0800 (China Standard Time)

Hi @felixschul ,

by the error message you attached, it seems you have any issue with rancher metadata infrastructure service. You are getting ..connection refused.. to metadata internal dns. Is your metadata service running correctly when you launch new gitlab-runner instances?? Are you having this issue just for this service??

It could be a race condition where new gitlab-runner instances are trying to start at new spot instances, before rancher metadata service is completely up and running on them. Could you please check it out??

Felix Schul · Answer 2 · Thu Jul 12 2018 18:16:34 GMT+0800 (China Standard Time)

Hi @rawmind0,

Many thanks for the super-fast feedback!

I noticed the issue only for this service. But this service might well be the only service that immediately executes a request that uses the metadata service on start. Maybe other services simply need a few seconds longer to start (or to pull) and this leads to the error not appearing. Also the error does not appear every time when the gitlab runner sidekick is started on a new AWS instance. It appears about 30% of the time. I also assume that this is a race condition where the gitlab runner sidekick starts just a little bit before the metadata service is ready. I cannot see any errors in the metadata service logs, so the service works fine once it started on the new instance.

From my perspective the gitlab runner sidekick container should check if the metadata service is ready and wait for it to get ready (maybe with some kind of loop that simply includes a "sleep").

It might also help if I would find a way to make rancher wait a few seconds before starting this container, but I found no option for this.

My only idea is to contribute to the community item and write an entrypoint script that checks the metadata service and waits for it to be available.

Any other ideas?

Thanks a lot!

Felix

Raúl Sánchez · Answer 3 · Thu Jul 12 2018 18:50:25 GMT+0800 (China Standard Time)

How are you deploying new gitlab-runner instances?? Are you using rancher cli into pipeline to do it??

An option could be use rancher cli to wait until network-services is healthy again, once you deploy new spot instances.

Get network service <STACK_ID>, rancher stack -s --format '{{.ID}} {{.Stack.Name}}'
Wait to deploy until <STACK_ID> rancher --wait-state healthy wait <STACK_ID>

May work for you?? :)

Felix Schul · Answer 4 · Thu Jul 12 2018 20:40:26 GMT+0800 (China Standard Time)

Hi @rawmind0,

I am using the "gitlab-ci-multi-runner" community catalog item, which is set to "Always run one instance of this container on every host". So when a new host is connected to Rancher, the scheduler will start a container and its sidekicks automatically on the new host. So no, I am not using the rancher cli for this (or maybe I got your question wrong). Just to be sure I will outline the current process:

The AWS Auto Scaling Group starts a new instance/host
A user script launches the registration with Rancher after the instance startet.
Then the host appears in Rancher hosts
Then scheduler starts the new containers (system services as well as my images).
Because of the section "Always run one instance of this container on every host" the scheduler will stats a gitlab runner container.
This starts a long with its sidekicks.
The sidekick "config" launches the "register" method that does an API call to the Gitlab Server to register the new runner.
This call fails sometimes because it cannot connect to the Gitlab Host (I double checked and the host is 100% available). Probably because the network is not ready yet.

I do not think your above suggestion can make a difference: Before I register the new spot instance with rancher, everything is fine. As soon as I register the new instance with Rancher, scheduler will start all services on this new host at the same time. I do not know how to tell the scheduler to change the order or wait with one of the services.

Best

Felix

Raúl Sánchez · Answer 5 · Fri Jul 13 2018 17:06:40 GMT+0800 (China Standard Time)

Hi @felixschul ,

i didn't fully understand how you are doing it, but i see your point now.

Best solution should be that gitlab-ci-multi-runner sidekick take care of dns resolution. More than happy if you could contribute with it.

Anyway, in the proposal line, if you are using "gitlab-ci-multi-runner" community catalog, you could set host label for running gitlab-ci-multi-runner instances, then it always run one instance of this service on every host with this label. You may add an additional step to your user script deployment to check for network-services healthy on hosts before put the label.

Felix Schul · Answer 6 · Tue Jul 17 2018 04:19:13 GMT+0800 (China Standard Time)

Hi @rawmind0,

Sorry for my late reply. I think setting a label after checking network services is a good idea. However I think it would be cleaner to make the gitlab runner sidekick wait for the services. I will check if I can contribute to this. Thanks for your support. I suggest to leave this issue open as I believe that this is really a problem with the "gitlab-multi-runner" catalog item, at least under certain circumstances. But feel free to close it if you judge this otherwise.