rakesh-mohanta / textract

Node module for extracting text from various file types

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

textract

A text extraction node module.

NPM NPM

Currently Extracts...

  • PDF
  • DOC
  • DOCX
  • XLS
  • XLSX
  • XLSB
  • XLSM
  • PPTX
  • DXF
  • PNG
  • JPG
  • GIF
  • RTF
  • application/javascript
  • All text/* mime-types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It's super easy to add an extractor for a new mime type.

Install

npm install textract

Requirements

  • PDF extraction requires pdftotext be installed, link
  • DOC extraction requires catdoc be installed, link
  • RTF extraction requires catdoc be installed
  • DOCX extraction requires unzip be available
  • PPTX extraction requires unzip be available
  • PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
  • DXF extraction requires drawingtotext be available, link

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console.

$ textract pathToFile

In your node app

Import

var textract = require('textract');

Execution

If you do not know the mime type of the file

textract(filePath, function( error, text ) {})

If you know the mime type of the file

textract(type, filePath, function( error, text ) {})

If you wish to pass some config...and know the mime type...

textract(type, filePath, config, function( error, text ) {})

If you wish to pass some config, but do not know the mime type

textract(filePath, config, function( error, text ) {})

Error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

If processing a .gif on OSX, an error will be thrown with a macProcessGif flag on it set to true. Tesseract has issues with .gifs on OSX.

Configuration

Configuration can be passed into textract. The following configuration options are available

  • preserveLineBreaks: By default textract does NOT preserve line breaks. Pass this in as true and textract will not strip any line breaks.
  • exec: Some extractors (xlsx, docx, dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
  • [ext].exec: Each extractor can take specific exec config.
  • macProcessGif: By default on OSX textract will not run tesseract on .gif files. (See this Stack Overflow post) If you've figured out to make it work, set this flag to true to turn gif processing back on.

Release Notes

0.12.0

  • #21, #22, Now using j via its binaries rather than using it via node. This makes XLS/X extraction slower, but reduces memory consumption of textract signifcantly.

0.11.2

  • Updated pdf-text-extract to latest, fixes #20.

0.11.1

  • Addressed path escaping issues with tesseract, fixes [#18] (dbashford#18)

0.11.0

  • Using j to handle xls and xlsx, this removes the requirement on the xls2csv binary.
  • j also supports xlsb and xlsm

About

Node module for extracting text from various file types