documentcloud / cloud-crowd

Parallel Processing for the Rest of Us

Home Page:https://github.com/documentcloud/cloud-crowd/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Unexpected error while processing request: execution expired" and " Unexpected error while processing request: getaddrinfo: Name or service not known" during word_count_example.rb and other jobs

jbfink opened this issue · comments

Hey folks,

I've got a small cluster of two OSX machines and one Linux box with another Linux box as a controller, all running Crowd 0.5.0. About half the time when I try to start jobs -- even simple ones like the Shakespeare word count -- I get the controller box crashing with errors like:

!! Unexpected error while processing request: execution expired
!! Unexpected error while processing request: getaddrinfo: Name or service not known

I stop the controller, rerun crowd load_schema*, and I start the controller again -- sometimes this works, sometimes it doesn't. As far as I can tell there's no lingering thin or crowd process running on the controller, so I'm not sure where the problem is coming from.

*note that I have a mysql database instance, but have been using the crowd load_schema command to effect a reset of sorts -- if this is wrong behaviour, please let me know.

I should add also that I occasionally (though not always, frustratingly enough) get the "/var/lib/gems/1.8/gems/rest-client-1.5.1/lib/restclient/request.rb:145:in `transmit': RestClient::ServerBrokeConnection" error too.

Sorry I didn't see this ticket until now ... That looks like a connectivity problem, no? Are you running the jobs over wifi or some sort of VPN?

Also, considering that you asked two weeks ago, did you ever get this sorted out?

Nope, not wifi and not VPN. And no, didn't get it sorted out either. I did find a gist where someone had the same problem and it might be a rack issue?

Although interestingly enough we do have a crappy network topography that sometimes stalls on transfers of very large files. It never stays stalled out, but it does make transferring things over rsync/scp very annoying. Perhaps there's something I can do about cloud-crowd's tolerance of flaky links? Increase a timeout or something?

I'm not sure -- we use the RestClient gem to do internal communication between the server and the nodes. Perhaps there's a patch that can be made there -- you can try setting the "open_timeout" option, and see if it helps your issue. I think that the first step would be to reliably reproduce the problem...

Is the open_timeout option in RestClient or somewhere in a cloud-crowd config?

It's in RestClient, check out the docs.

http://rdoc.info/rdoc/archiloque/rest-client/blob/6079fb070dc8b7a645dbd806e696c057afab1f5d/RestClient/Resource.html

You'd patch your install of CloudCrowd to set it.