"Unexpected error while processing request: execution expired" and " Unexpected error while processing request: getaddrinfo: Name or service not known" during word_count_example.rb and other jobs

Question

"Unexpected error while processing request: execution expired" and " Unexpected error while processing request: getaddrinfo: Name or service not known" during word_count_example.rb and other jobs

jbfink opened this issue 14 years ago · comments

Hey folks,

I've got a small cluster of two OSX machines and one Linux box with another Linux box as a controller, all running Crowd 0.5.0. About half the time when I try to start jobs -- even simple ones like the Shakespeare word count -- I get the controller box crashing with errors like:

!! Unexpected error while processing request: execution expired
!! Unexpected error while processing request: getaddrinfo: Name or service not known

I stop the controller, rerun crowd load_schema*, and I start the controller again -- sometimes this works, sometimes it doesn't. As far as I can tell there's no lingering thin or crowd process running on the controller, so I'm not sure where the problem is coming from.

*note that I have a mysql database instance, but have been using the crowd load_schema command to effect a reset of sorts -- if this is wrong behaviour, please let me know.

John Fink · Answer 1 · Tue Jul 13 2010 02:45:38 GMT+0800 (China Standard Time)

I should add also that I occasionally (though not always, frustratingly enough) get the "/var/lib/gems/1.8/gems/rest-client-1.5.1/lib/restclient/request.rb:145:in `transmit': RestClient::ServerBrokeConnection" error too.

Jeremy Ashkenas · Answer 2 · Fri Jul 30 2010 00:25:08 GMT+0800 (China Standard Time)

Sorry I didn't see this ticket until now ... That looks like a connectivity problem, no? Are you running the jobs over wifi or some sort of VPN?

Also, considering that you asked two weeks ago, did you ever get this sorted out?

John Fink · Answer 3 · Fri Jul 30 2010 01:12:07 GMT+0800 (China Standard Time)

Nope, not wifi and not VPN. And no, didn't get it sorted out either. I did find a gist where someone had the same problem and it might be a rack issue?

John Fink · Answer 4 · Fri Jul 30 2010 01:43:08 GMT+0800 (China Standard Time)

Although interestingly enough we do have a crappy network topography that sometimes stalls on transfers of very large files. It never stays stalled out, but it does make transferring things over rsync/scp very annoying. Perhaps there's something I can do about cloud-crowd's tolerance of flaky links? Increase a timeout or something?

Jeremy Ashkenas · Answer 5 · Fri Jul 30 2010 01:57:43 GMT+0800 (China Standard Time)

I'm not sure -- we use the RestClient gem to do internal communication between the server and the nodes. Perhaps there's a patch that can be made there -- you can try setting the "open_timeout" option, and see if it helps your issue. I think that the first step would be to reliably reproduce the problem...

John Fink · Answer 6 · Fri Jul 30 2010 04:06:11 GMT+0800 (China Standard Time)

Is the open_timeout option in RestClient or somewhere in a cloud-crowd config?

Jeremy Ashkenas · Answer 7 · Fri Jul 30 2010 04:13:08 GMT+0800 (China Standard Time)

It's in RestClient, check out the docs.

http://rdoc.info/rdoc/archiloque/rest-client/blob/6079fb070dc8b7a645dbd806e696c057afab1f5d/RestClient/Resource.html

You'd patch your install of CloudCrowd to set it.