ololobus / rika

A JRuby wrapper for Apache Tika to extract content and metadata from various file formats.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rika

A JRuby wrapper for Apache Tika to extract text and metadata from various file formats.

More information about Apache Tika can be found here: http://tika.apache.org/

Code Climate Build Status

Installation

Add this line to your application's Gemfile:

gem 'rika'

Remember that this gem only works on JRuby.

And then execute:

$ bundle

Or install it yourself as:

$ gem install rika

Usage

For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:

require 'rika'

content           = Rika.parse_content('document.pdf')    # string containing all content text
metadata          = Rika.parse_metadata('document.pdf')   # hash containing the document metadata
content, metadata = Rika.parse_content_and_metadata('document.pdf')   # both of the above

For other use cases and finer control, you can work directly with the Rika::Parser object:

require 'rika'

parser = Rika::Parser.new('document.pdf')

# Return the content of the document:
parser.content 

# Return the media type for the document:
parser.media_type 
=> "application/pdf"

# Return the metadata field title if it exists:
parser.metadata["title"] if parser.metadata_exists?("title") 

# Return all the available metadata keys that can be read from the document
parser.available_metadata

# Return only the first 10000 chars of the content:
parser = Rika::Parser.new('document.pdf', 10000)
parser.content # 10000 first chars returned

# Return content from URL
parser = Rika::Parser.new('http://riakhandbook.com/sample.pdf', 200)
parser.content

# Return the language for the content
parser = parser = Rika::Parser.new('german document.pdf')
parser.language
=> "de"

# Check whether the langugage identification is certain enough to be trusted
parser.language_is_reasonably_certain?
	

Credits

The following people have contributed ideas, documentation, or code to Rika:

  • Keith Bennett
  • Richard Nyström

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

About

A JRuby wrapper for Apache Tika to extract content and metadata from various file formats.

License:MIT License


Languages

Language:Ruby 100.0%