jondot / hog

Supercharge your Ruby Pig UDFs with Hog.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hog

Supercharge your Ruby Pig UDFs with Hog.

Why Use Hog?

Use Hog when you want to process (map) tuples with Hadoop, and generally belive that:

  • It's too much overhead to write custom M/R code
  • You're too lazy to process with Pig data pipes and sprinkled custom UDFs, and they're not powerful enough anyway.
  • Data processing with Hive queries is wrong
  • Cascading is an overkill for your case
  • Streaming Hadoop (unix processes) is the last resort

And instead want to develop within your familiar Ruby ecosystem but also:

  • Control and tweak job parameters via Pig
  • Use Pig's job management
  • Use Pig's I/O (serdes) easily

And you also want to use gems, jars, and everything Ruby (JRuby) has to offer for a really quick turn-around and pleasent development experience.

Quickstart

Let's say you want to perform geolocation, detect a device's form factor, and shove bits and fields from the nested structure into a Hive table to be able to query it efficiently.

This is your raw event:

{
  "x-forwarded-for":"8.8.8.8",
  "user-agent": "123",
  "pageview": {
    "userid":"ud-123",
    "timestamp":123123123,
    "headers":{
      "user-agent":"...",
      "cookies":"..."
    }
  }
}

To do this with Pig alone, you would need:

  • A pig UDF for geolocation
  • A pig UDF for form-factor detection
  • A way to extract JSON fields from a tuple in Pig
  • Combine everything in Pig code

And you'll give up

  • Testability
  • Maintainability
  • Development speed
  • Eventually, sanity

Using Hog

With Hog, you'll mostly feel you're describing how you want the data to be shaped, and it'll do the heavy lifting.

You always have the option to 'drop' to code with the prepare block.

# gem install hog
require 'hog'

TupleProcessor = Hog.tuple "mytuple" do

  # Your own custom feature extraction. This can be any free style
  # Ruby or Java code
  #
  prepare do |hsh|
    loc = ip2location(hsh["x-forwarded-for"])
    hsh.merge!(loc)

    form_factor = formfactor(hsh["user-agent"])
    hsh.merge!(form_factor)
  end


  # Describe your data columns (i.e. for Hive), and how you'd like
  # to pull them out of your nested data.
  #
  # You can also put a fixed value, or serialize a sub-nested
  # structure as-is for later use with Hive's json_object
  #
  chararray :id, :path => 'pageview.userid'
  chararray :created_at, :path => 'pageview.timestamp', :with_nil => ''
  float :sampling, :value => 1.0
  chararray :ev_type, :value => 'pageview'
  chararray :ev_info, :json => 'pageview.headers', :default => {}
end

As you'll notice - there's very little code, and you specify the bare essentials -- which you would have had to specify (types and fields) with Pig in any way you'd try to use it.

Hog will generate a TupleProcessor that conforms to your description and logic.

Testing

To test your mapper, just use TupleProcessor within your specs and give it a raw event. It's plain Ruby.

Hadoop

To hookup with Hadoop, rig TupleProcessor within a Pig JRuby UDF shell like so:

require 'tuple_processor'
require 'pigudf'

class TupleProcessorUdf < PigUdf
  # TupleProcessor will automatically generate your UDF schema!, no matter how complex or
  # convoluted.
  outputSchema TupleProcessor.schema

  # Use TupleProcessor as the processing logic.
  def process(line)
    # since 1 raw json can output several rows, 'process' returns
    # an array. we pass that to Pig as Pig's own 'DataBag'.
    # You can also do without and just return whatever TupleProcessor#process returns.
    databag = DataBag.new
    TupleProcessor.process(line).each do |res|
      databag.add(res)
    end

    databag
  end
end

Gems and Jars

So you can't really use gems or jars with Pig Ruby UDFs unless all machines have them; and then it becomes ugly to manage really.

The solution is neat: pack everything as you would with a JRuby project, for example, use warbler; and your job code becomes a standalone-jar which you deploy, version, and possibly generate via CI.

Related Projects

Contributing

Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).

Copyright

Copyright (c) 2014 Dotan Nahum @jondot. See MIT-LICENSE for further details.

About

Supercharge your Ruby Pig UDFs with Hog.

License:MIT License


Languages

Language:Ruby 100.0%