documentcloud / cloud-crowd

Parallel Processing for the Rest of Us

Home Page:https://github.com/documentcloud/cloud-crowd/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remote node never finishes processing

zstumgoren opened this issue · comments

Problem description

A remote node is unable to complete processing a DocumentImport Action as part of a standard upload using the dcloud web interface. But that same file is processed successfully if the crowd node is running on the same machine as the crowd server.

This bug is similar (or possibly identical) to #42

Environmental context

We've only encountered this bug in a local virtual environment used as a staging platform for our production deployments.

  • Ubuntu 12.04 instances
  • Virtualbox 4.3.30
  • Vagrant 1.7.2
  • Host machine: OS Yosemite (10.10.4)

Debug details

After uploading a document manually through the DCloud web interface, the crowd server appears to successfully allocate work units to a lone remote node. Using debug statements (see below), we've determined that the server appears to be getting back a successful response from the POST request to the node (a.k.a. the process ID of the forked worker on the remote node).

However, the job on the remote node never completes -- no file artifacts are written to disk and the OpCenter reports the job as SPLITTING. This status remains indefinitely until we kill the processes and clean up the db records manually.

As part of the debugging process, we've verified repeatedly that the server can reach the node and the node can reach the server (using telnet, ping and by the fact that the remote node is able to check in initially and work units are distributed to it).

We've been able to get the crowd server to log extra details by sprinkling some print statements into NodeRecord.send_work_unit. We can also hit (and log from) the crowd node's heartbeat/ endpoint (see below). However, we're unable to successfully log any information from the forked Worker processes (again, see below).

Is it possible that the forked process is somehow dying immediately after being spawned??? It's the only thing we can think of that would explain this issue and, most significantly, the fact that no logging is performed by the forked worker even though its PID is returned to the server.

Having said all that, it's also quite possible that we're not correctly debugging the forked process (fwiw, we have been restarting the crowd node and server processes when we manually update the source files with debug statements).

At this point, we're at a bit of a loss and could really use some guidance -- if nothing else, a sanity check on our debugging strategy and possible alternative techniques for isolating the issue.

Any advice/guidance is appreciated.

Thanks!

CODE DEBUG

We dropped logging statements into numerous sections of code in an effort to isolate the problem. We c

# LOGGING WORKS! SERVER REPORTS  A PID; HOWEVER, THAT PID CAN'T BE FOUND ON REMOTE NODE, DESPITE THE FACT THAT THE JOB APPEARS TO BE HUNG....
#### cloud-crowd/lib/cloud_crowd/models/node_record.rb #### 
def send_work_unit(unit)
44       puts "INSIDE NodeRecord.send_work_unit!!!"
45       puts "UNIT data: #{unit.attributes}"
46       puts "NODE data: #{node}"
47       result = node['/work'].post(:work_unit => unit.to_json)
48       puts "POST request made!!!"
49       puts "POST response: #{result}"
50       unit.assign_to(self, JSON.parse(result.body)['pid'])
51       touch && true


#### cloud-crowd/lib/cloud_crowd/node.rb ####
# LOGGING WORKS ON GET
get '/heartbeat' do
53       puts "Inside GET /heartbeat" ## THIS WORKS AND GETS WRITTEN TO NODE.LOG
54       "buh-bump"
55     end
56
## LOGGING FAILS ON POST: BELOW PRINT STATEMENT NEVER GOT LOGGED! BUT THE PID APPEARS TO BE RETURNED SUCCESSFULLY TO THE CROWD SERVER...
57     # Posts a WorkUnit to this Node. Forks a Worker and returns the process id.
58     # Returns a 503 if this Node is overloaded.
59     post '/work' do
60       puts "Inside POST /work" 
61       throw :halt, [503, OVERLOADED_MESSAGE] if @overloaded
62       unit = JSON.parse(params[:work_unit])
63       pid = fork { Worker.new(self, unit).run }
64       Process.detach(pid)
65       json :pid => pid
66     end

####  cloud-crowd/lib/cloud_crowd/worker.rb #### 
# ALL LOGGING ATTEMPTS DURING INIT AND IN RUN METHOD FAIL!!!
19     # A new Worker customizes itself to its WorkUnit at instantiation.
20     def initialize(node, unit)
21       @start_time = Time.now
22       @pid        = $$
23       @node       = node
24       @unit       = unit
25       @status     = @unit['status']
26       @retry_wait = RETRY_WAIT
27       $0 = "#{unit['action']} (#{unit['id']}) [cloud-crowd-worker]"
28       log "Worker initialized: pid: #{@pid}, node: #{@node}, unit: #{@unit}, status: #{@status}"
29       puts "Worker initialized: pid: #{@pid}, node: #{@node}, unit: #{@unit}, status: #{@status}"
30     end

# <<< SNIP >>>
def run
100       log "INSIDE Worker.run!!!!"
101       `echo "INSIDE WORKER.RUN" > /tmp/apdocs_debug.txt`
102       trap_signals
103       log "starting #{display_work_unit}"
104       if @unit['options']['benchmark']
105         log("ran #{display_work_unit} in " + Benchmark.measure { run_work_unit }.to_s)
106       else
107         run_work_unit
108       end
109       Process.exit!
110     end

This turned out to be a bug related to how our VirtualBox environment is configured, possibly related to NAT configurations inside guest machines. We seem to have resolved this for the time being. Sorry for the trouble!