Node does not create worker

Question

Node does not create worker

thevman opened this issue 12 years ago · comments

Hi,

I've run into what I believe is a unique problem...

I have a cloud-crowd server running on one machine and a node running on another machine. When I start up the node, it is recognized but it does not spawn off any workers even when there is a job...

The node is recognized and shows up or disappears depending on whether I start or stop the node from the second machine.

To clear things up, if the node on the server itself is up, it processes the document without an issue, but the external node does no work when the node on the server is down.

Please help...we are very close to production and this problem has cropped up at the last minute...

Ted Han · Answer 1 · Wed Oct 31 2012 03:34:47 GMT+0800 (China Standard Time)

Hey @thevman,

You can tail the server console, and if you see a notification of an unexpected error, that usually indicates that the node has connected to the server, but misidentified what the node's address is. As a consequence the server can't actually reach the node to fire up any workers.

If you do hit a case like that, make sure that the node is reachable at the hostname it's providing to the server, and make sure to clear out your NodeRecords before trying to connect a new node.

Vineet Sharma · Answer 2 · Wed Oct 31 2012 04:09:50 GMT+0800 (China Standard Time)

How do I tail the server console?

Vineet Sharma · Answer 3 · Wed Oct 31 2012 23:31:12 GMT+0800 (China Standard Time)

Hi @knowtheory,

Here's what I think my problem is:

Console:

crowd -c config/cloud_crowd/production -e production console
irb(main):001:0> NodeRecord.all
=> [#<CloudCrowd::NodeRecord id: 25, host: "domU-12-31-39-07-80-27:9063", ip_address: "10.209.131.213", port: 9063, enabled_actions: "process_pdfs,word_count,graphics_magick", busy: false, tag: "", max_workers: 5, created_at: "2012-10-31 10:47:09", updated_at: "2012-10-31 10:47:09">]

Telnet to 10.209.131.213 on port 9063 works.

Telnet to domU-12-31-39-07-80-27 on port 9063 fails...

Telnet to domU-12-31-39-07-80-27.compute-1.internal on port 9063 is successful.

Also, running "hostname" on the external cloud crowd server gives me "domU-12-31-39-07-80-27".

So, the external cloud crowd server is registering as "domU-12-31-39-07-80-27:9063" and that's the url that I think the server is sending the 'work' to.

So, to summarize, the server should be sending the work unit to the FQDN and not the 'hostname' for the server.

Thoughts?

UPDATE: I tried changing the hostname with the following command

hostname hostname -f

before i start up my node. it registers with the full FQDN now but still no go.

Right after I upload a document, it shows up in the S3 and the console shows the following:

irb(main):005:0> NodeRecord.last
=> #<CloudCrowd::NodeRecord id: 26, host: "domU-12-31-39-07-80-27.compute-1.internal:9063", ip_address: "10.209.131.213", port: 9063, enabled_actions: "graphics_magick,process_pdfs,word_count", busy: false, tag: "", max_workers: 5, created_at: "2012-10-31 12:03:23", updated_at: "2012-10-31 12:03:23">
irb(main):007:0> WorkUnit.all
=> [#<CloudCrowd::WorkUnit id: 180, status: 4, job_id: 58, input: "107", action: "document_import", attempts: 0, node_record_id: nil, worker_pid: nil, reservation: nil, time: nil, output: nil, created_at: "2012-10-31 12:03:55", updated_at: "2012-10-31 12:03:55">]

The node exists and a WorkUnit is created but is not being assigned to the only node that exists on the system...

TSL Media Inc. · Answer 4 · Fri Jul 19 2013 10:44:41 GMT+0800 (China Standard Time)

have you gotten this problem solved. I'm experiencing the same issue.

Ted Han · Answer 5 · Fri Jul 19 2013 21:07:55 GMT+0800 (China Standard Time)

So the other thing you guys can try is to actually make an http request from the server machine to your node machines.

The server & nodes all communicate over http, and respond to a heartbeat request. For example:

ubuntu@ip-100-166-235-15:~/documentcloud$ curl ip-100-215-23-1:9063/heartbeat; echo buh-bump

Work distribution happens when new nodes check in. The handshake goes, a Node comes up, it makes a request to the Server identifying itself and how to reach it. The Server creates a NodeRecord for the Node, and then tries to send work to the Node. If the Node does not respond, then after a minute or two, the NodeRecord is deleted for that node.

Additionally, if there are zombie processes on the node, it can block a node from starting up properly, there's a rake task, rake crowd:node:cull that we wrote up to help out with that. On your node, run rake crowd:node:stop crowd:node:cull and that should kill all the lingering crowd processes. From there you should be working w/ a clean slate to start up rake crowd:node:start (and you should stick the appropriate environment when you start up your node rake staging crowd:node:start for example).

Hope that helps.

Vineet Sharma · Answer 6 · Fri Jul 19 2013 22:02:34 GMT+0800 (China Standard Time)

No change in status. Please reopen.

I brought up the external cloud crowd node and changes it's hostname to the ip.

On the cloud crowd server:

ubuntu@10:~/documentcloud$ hostname -f
10.0.2.47

Then I ran:

cd ~/documentcloud
rake production crowd:node:stop
rake production crowd:node:cull

Got the output :

ubuntu@10:~/documentcloud$ rake production crowd:node:cull
kill: No such process
rake aborted!
SIGTERM

Tasks: TOP => crowd:node:cull
(See full trace by running task with --trace)

Then ran:

rake production crowd:node:start

On the DocumentCloud Server:

rake production crowd:console

Gives:

irb(main):002:0> NodeRecord.all
=> [#<CloudCrowd::NodeRecord id: 43, host: "ip-10-0-2-57:9063", ip_address: "10.                                                  0.2.57", port: 9063, enabled_actions: "large_document_import,reindex_document,va                                                  cuum_analy...", busy: false, tag: "", max_workers: 5, created_at: "2013-07-18 14                                                  :27:14", updated_at: "2013-07-18 14:28:25">, #<CloudCrowd::NodeRecord id: 44, ho                                                  st: "10.0.2.47:9063", ip_address: "10.0.2.47", port: 9063, enabled_actions: "doc                                                  ument_import,word_count,redact_pages,document_re...", busy: false, tag: "", max_                                                  workers: 5, created_at: "2013-07-19 09:43:51", updated_at: "2013-07-19 09:43:51"                                                  >]
irb(main):003:0> exit

10.0.2.47 is the new node.

curl 10.0.2.47:9063/heartbeat; echo buh-bump

gives an output of "buh-bumpbuh-bump"

When I upload a document with only the node on the documentcloud server active, I get the following output in production.log:

Correct

Processing ImportController#upload_document (for 206.177.43.77 at 2013-07-18 14:28:00) [POST]
Parameters: {"file"=>#File:/tmp/RackMultipart20130718-14158-11xff9y-0, "description"=>"", "title"=>"8014 Oitc Brochure Fre v3 Final", "email_me"=>"1", "access"=>"private", "action"=>"upload_document", "language"=>"eng", "authenticity_token"=>"some value", "make_public"=>"true", "multi_file_upload"=>"true", "controller"=>"import", "source"=>""}
New RightAws::S3Interface using shared connections mode
New RightAws::S3Interface using shared connections mode
Completed in 471ms (View: 14, DB: 46) | 200 OK [https://stage.docs.ontariogovernment.ca/import/upload_document]

Processing DocumentsController#status to json (for 206.177.43.77 at 2013-07-18 14:28:11) [GET]
Parameters: {"format"=>"json", "action"=>"status", "ids"=>["194"], "controller"=>"documents"}
New RightAws::S3Interface using shared connections mode
Completed in 304ms (View: 201, DB: 20) | 200 OK [https://stage.docs.ontariogovernment.ca/documents/status.json?ids%5B%5D=194]

Processing DocumentsController#status to json (for 206.177.43.77 at 2013-07-18 14:28:21) [GET]
Parameters: {"format"=>"json", "action"=>"status", "ids"=>["194"], "controller"=>"documents"}
New RightAws::S3Interface using shared connections mode
Completed in 200ms (View: 39, DB: 22) | 200 OK [https://stage.docs.ontariogovernment.ca/documents/status.json?ids%5B%5D=194]

Processing ImportController#cloud_crowd (for 127.0.0.1 at 2013-07-18 14:28:25) [POST]
Parameters: {"action"=>"cloud_crowd", "job"=>"{"percent_complete":100,"color":"029331","time_taken":24.890727,"status":"succeeded","outputs":[194],"id":82,"work_units":0}", "controller"=>"import"}
Completed in 22ms (View: 1, DB: 11) | 201 Created [http://stage.docs.ontariogovernment.ca/import/cloud_crowd]

However, when I switch on the second node or use only the external node, I get:

Error

Processing ImportController#upload_document (for 206.177.43.77 at 2013-07-18 14:25:41) [POST]
Parameters: {"file"=>#File:/tmp/RackMultipart20130718-14158-1wb64x8-0, "description"=>"", "title"=>"8014 Oitc Brochure Fre v3 Final", "email_me"=>"1", "access"=>"private", "action"=>"upload_document", "language"=>"eng", "authenticity_token"=>"some value", "make_public"=>"true", "multi_file_upload"=>"true", "controller"=>"import", "source"=>""}
New RightAws::S3Interface using shared connections mode
New RightAws::S3Interface using shared connections mode
Completed in 535ms (View: 11, DB: 45) | 200 OK [https://stage.docs.ontariogovernment.ca/import/upload_document]

Processing ImportController#cloud_crowd (for 127.0.0.1 at 2013-07-18 14:25:42) [POST]
Parameters: {"action"=>"cloud_crowd", "job"=>"{"percent_complete":100,"color":"a11367","time_taken":1.107833,"status":"failed","outputs":["Couldn't find Document with ID=193"],"id":81,"work_units":0}", "controller"=>"import"}
Document import failed: {"id"=>81, "color"=>"a11367", "work_units"=>0, "outputs"=>["Couldn't find Document with ID=193"], "percent_complete"=>100, "status"=>"failed", "time_taken"=>1.107833}
Completed in 36ms (View: 1, DB: 8) | 201 Created [http://stage.docs.ontariogovernment.ca/import/cloud_crowd]

Hope this helps.

TSL Media Inc. · Answer 7 · Sat Jul 20 2013 01:55:40 GMT+0800 (China Standard Time)

mine appears to be an issue with AWS and their public and private DNSs. I've open up the ports in the security group so it's not that, but this is the issue. When I run heartbeat on the DNS that shows on the OperationsCenter:

curl domU-12-31-39-14-F6-AD:9063/heartbeat; echo buh-bump

It returns:

curl: (6) Couldn't resolve host 'domU-12-31-39-14-F6-AD' buh-bump

The problem is the Private DNS is really domU-12-31-39-14-F6-AD.compute-1.internal

So when I run:

curl domU-12-31-39-14-F6-AD.compute-1.internal:9063/heartbeat; echo buh-bump

I get the appropriate:

buh-bumpbuh-bump

Is there a specific way to make this work within AWS?

Ted Han · Answer 8 · Sat Jul 20 2013 02:45:19 GMT+0800 (China Standard Time)

@knotio DocumentCloud is running on AWS, on ubuntu images. Nodes identify themselves based on what hostname reports. So make sure that's an address at which the server can reach the node.

TSL Media Inc. · Answer 9 · Sat Jul 20 2013 03:12:19 GMT+0800 (China Standard Time)

the node hostname returns domU-12-31-39-14-F6-AD:9063 which is what is listed in the Operations Center, but I can't get a heartbeat on it – it comes back unresolved.

I have to check heartbeat on domU-12-31-39-14-F6-AD.compute-1.internal:9063 to get a buh-bumpbuh-bump back.

I'm also running ubuntu on aws.

TSL Media Inc. · Answer 10 · Sat Jul 20 2013 04:08:46 GMT+0800 (China Standard Time)

I created a new node instance (same type, same AMI) that gave me a ip-xx-xxx-xx-xxx hostname instead of a domU-xx-xxx-xx-xxx hostname and the node works perfectly.

Any info as to why Cloud-Crowd won't work with a domU- and how I can spin up an ip- everytime?

TSL Media Inc. · Answer 11 · Sat Jul 20 2013 05:56:32 GMT+0800 (China Standard Time)

now I'm getting the zombie processes that you discussed above, and I ran the rake crowd:node:cull and it give me:

rake aborted! No Rakefile found (looking for: rakefile, Rakefile, rakefile.rb, Rakefile.rb)

David Lemayian ✨ · Answer 12 · Wed Aug 28 2013 23:00:48 GMT+0800 (China Standard Time)

Hi @knotio. Was getting the same problem and this is what I ended up doing:

Stopping all the cloud crowd processes (nodes & servers). This can be achieved by either rake crowd:node:cull OR crowd node stop and crowd server stop .
Deleting all the rows of the jobs, node_records and work_units tables of the cloud crowd database.
Starting the server and node again by running rake crowd:server:start and rake crowd:node:start OR crowd server start and crowd node start.

Not sure if this works for you too but you would essentially have to redo the tasks\jobs that were still being processed when the problem occurred. Perhaps @knowtheory would know how to have the tasks kick off from where they were left off.

Cheers,
David.