travis-ci / worker

Worker runs your Travis CI jobs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cancellation mechanism can break if job is requeued to same worker

sarahhodne opened this issue · comments

I've seen the "there's already a subscription for job …" error message pop up somewhat regularly and decided to try to take a look at what's going on.

It looks like it's possible for a job to be queued onto the same worker after a requeue (more common during slow times, since RabbitMQ seems to queue jobs in "order", so if a worker has 20 consumers, it will get 20 jobs in a row before another worker gets any), and since the cancellation "unsubscription" happens at the very end it can go a few seconds between a requeue and a cancellation unsubscription (e.g. if the instance is shut down in between).

I think a solution for this would be to change the canceller to allow multiple cancellation subscriptions for the same job ID. Another option is to make sure that the canceller is "unsubscribed" before requeueing a job, but that's involving a lot more different code paths that we need to check this, and it'd probably be easy to miss one when editing code, so that feels like a more fragile solution to me, which is why I think allowing multiple cancellations would be better.