tomcon/node-word-extractor

word-extractor

Read data from a Word document using node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they all appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the node.js environment.

How do I install this module?

yarn add word-extractor

# Or using npm... 
npm install word-extractor

How do I use this module?

var WordExtractor = require("word-extractor");
var extractor = new WordExtractor();
var extracted = extractor.extract("file.doc");
extracted.then(function(doc) {
  console.log(doc.getBody());
});

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

WordExtractor#extract(file)

Main method to open a Word file and retrieve the data. Returns a promise which resolves to a Document.

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getEndnotes()

Retrieves the endnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders()

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

License

Licensed under the MIT License.

tomcon / node-word-extractor