documentcloud / cloud-crowd

Parallel Processing for the Rest of Us

Home Page:https://github.com/documentcloud/cloud-crowd/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Set Blows Up inputs

forgotpw1 opened this issue · comments

I have hit a problem with a big set of inputs (500+). I am building a file compression box. An action to zip a set of inputs

When I try and merge the input it appears that upstream, in the process, the inputs have "blown up. " Process is just saving each file to S3, and should return the new path to the inputs Array.

Specifically, in the merge inputs should be an Array, but instead it is coming though as a String.

Error Message looks like this

Worker #18890: {:pid=>18890, :id=>370, :time=>0.007631538, :status=>"failed", :output=>"{\"output\":\"undefined method `each' for #<String:0x00000002995078>\"}"}

Anyone else ever hit this?

I thought this Could this be due to a text field on the database filling up with many characters. I switched this to a longtext field but that didn't do the trick.

Is there some other memory issue with filling an array?

Here's my action. It is erroring on the block with inputs.each

require 'zip/zip'
require 'zip/zipfilesystem'
require 'fileutils'
require 'rest-client' 
require 'json'
class ScanZipper < CloudCrowd::Action

  #Download files
  def process
    save_path = save("#{file_name}")
    save_path
  end

  #Archive them.
  def merge
    puts input.class
    puts input
    name = options['last_name']
    date = Time.now
    date = date.strftime("%Y%m%d")
    url = options["point"]
    scan_id = options["scan_id"]
    files_to_remove = []

    zip_file_name = "#{name}#{date}.zip"
    zipfile = Zip::ZipFile.open(zip_file_name, Zip::ZipFile::CREATE) do |zip|
      input.each do |batch_url|
        batch_path = File.basename(batch_url)
        file = download(batch_url, batch_path)
        puts batch_path
        tmp_file = batch_path
        zip.add batch_path, file
        files_to_remove << file
      end
    end

    zip_path = save("#{name}#{date}.zip")

    files_to_remove.each {|f| File.delete f}  

    zip_path
  end


end

I'm still experiencing this. My guess is there is some limit getting hit. I was able to temporarily avoid it by not using s3 authentication, but with a big set (700+ inputs) it exploded again. I think all the characters in the keys and signatures are giving the input object more than it can handle.

This is pretty much a show stopper and makes it really hard to use this in production.

I'm optimistic though that someone out there knows what's going on here.

What is the overall lifecycle of input?

Is input really a JSON object?

Is there a memory limit on thin or in ruby for JSON objects? Could another JSON library solve this.

Stuff works great when the input size is small, but unfortunately large sets aren't working and it's really hard to pin down what makes this happen. It's like the server instance can't handle the size of the input array.

Any insight into this would be helpful.