documentcloud / cloud-crowd

Parallel Processing for the Rest of Us

Home Page:https://github.com/documentcloud/cloud-crowd/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RESERVATION_LIMIT > 1 seems to cause deadlocks when all nodes don't have the same actions

jgeiger opened this issue · comments

I was running into an issue where the system would deadlock if I had one node with 2 actions (databaser, scheduler) and a separate node with a single action (processor). The reservation system would lock all of the work units to the node that didn't have the 'processor' action causing a lockup. I noticed there was code that should have pushed it to the end of the queue, but the reservation was never dropped since it fell in the ensure.

Here's a link to the change, but it might be worth an explanation of why you'd want to reserve more than a single work unit anyway. Also, there might be a code way to release the reservation if the node doesn't have that action available instead of just pushing it to the end of the array. This just seemed the easiest way to go.

http://github.com/jgeiger/cloud-crowd/commit/1b2f2fe55a3985b8b8da2412447d5163d6f529dd

I think this might be a better solution which will allow more reservations.

Instead of pushing the work unit to the back of the queue if the node can't process it, just release the reservation so another node can try to pick it up. So far I haven't had an issue with the deadlocks.

http://github.com/jgeiger/cloud-crowd/commit/c3a741c1fd747d0104e7a2c588dbcde5dab6f9a4

Thanks for the patch. I've merged your changes and set the limit back to 25. This will go out as part of CloudCrowd 0.4.0