read-comfortably

turns any web page into a clean view for reading.

This module is based on arc90's readability project.

Example

Install

$ npm install --save read-comfortably

Usage

read(html [, options], callback)

Where

html url or html code.
options is an optional options object
callback is the callback to run - callback(error, article, meta)

Example

var read = require('read-comfortably');

read('http://howtonode.org/really-simple-file-uploads', function(err, article, meta) {
  // Main Article
  console.log(article.content);
  // Title
  console.log(article.title);

  // HTML Source Code
  console.log(article.html);
  // DOM
  console.log(article.document);

  // Description Article
  console.log(article.getDesc(300));

  // Response Object from fetchUrl Lib
  console.log(meta);

  // Article's Images
  article.getImages(function (err, images) {
    console.log(images);
  });

  // HTML Source Code by replace css files
  article.getHtmls([ { selector: 'link[rel="stylesheet"]', attr: 'href', tag: 'style' } ], function (err, html) {
    console.log(html);
  });
});

Options

read-comfortably will pass the options to fetchUrl directly. See fetchUrl lib to view all available options.

read-comfortably has eleven additional options:

urlprocess which should be a function to check or modify url before passing it to readability.

options.urlprocess = callback(url);

read(
  url,
  {
    urlprocess: function(url) {
      //...
    }
  },
  function(err, article, meta) {
    //...
  }
);

preprocess which should be a function to check or modify downloaded source before passing it to readability.

options.preprocess = callback($, options);

read(
  url,
  {
    preprocess: function($, options) {
      //...
    }
  },
  function(err, article, meta) {
    //...
  }
);

postprocess which should be a function to check or modify article content after passing it to readability.

options.postprocess = callback(node, $);

read(
  url,
  {
    postprocess: function(node, $) {
      //...
    }
  },
  function(err, article, meta) {
    //...
  }
);

afterToRemove which allow set your own nodes to remove array for tags after grabArticle function.

options.afterToRemove = array; (default ['script', 'noscript'])

read(
  url,
  {
    afterToRemove: [
      'iframe',
      'script',
      'noscript'
    ]
  },
  function(err, article, meta) {
    //...
  }
);

nodesToRemove which allow set your own nodes to remove array for tags.

options.nodesToRemove = array;

read(
  url,
  {
    nodesToRemove: [
      'meta',
      'aside',
      'style',
      'object',
      'iframe',
      'script',
      'noscript'
    ]
  },
  function(err, article, meta) {
    //...
  }
);

noChdToRemove which allow set your own nodes to remove array when it no children for tags.

options.noChdToRemove = array; (default ['div'])

read(
  url,
  {
    noChdToRemove: [
      'div',
      'li'
    ]
  },
  function(err, article, meta) {
    //...
  }
);

considerDIVs true for turn all divs that don't have children block level elements into p's.

options.considerDIVs = boolean; (default false)

read(
  url,
  {
    considerDIVs: true
  },
  function(err, article, meta) {
    //...
  }
);

nodesToScore which allow set your own nodes to score array for tags.

options.nodesToScore = array; (default ['p', 'article'])

read(
  url,
  {
    nodesToScore: ['p', 'pre']
  },
  function(err, article, meta) {
    //...
  }
);

nodesToAppend which allow set your own nodes to append array for tags.

options.nodesToAppend = array; (default ['p'])

read(
  url,
  {
    nodesToAppend: ['pre']
  },
  function(err, article, meta) {
    //...
  }
);

maybeImgsAttr which allow set your own maybe image's attributes.

options.maybeImgsAttr = array; (default ['src', 'href'])

read(
  url,
  {
    maybeImgsAttr: ['src', 'data-src']
  },
  function(err, article, meta) {
    //...
  }
);

hostnameParse which allow you to convert to another hostname.

options.hostnameParse = object;

read(
  url,
  {
    hostnameParse = { 'www.google.com': 'www.google.com.hk' }
  },
  function(err, article, meta) {
    //...
  }
);

article object

If html is an image, article is a buffer.

Else

content

The article content of the web page.

title

The article title of the web page. It's may not same to the text in the <title> tag.

html

The original html of the web page.

dom

The document of the web page generated by jsdom. You can use it to access the DOM directly(for example, article.document.getElementById('main')).

getDesc(length)

The article description of the web page.

getImages(callback)

The article content's images of the web page.

getHtmls(files, callback)

The original html of the web page by replace specified file.

meta object

status

HTTP status code

responseHeaders

response headers

finalUrl

last url value, useful with redirects

redirectCount

how many redirects happened

cookieJar

CookieJar object for sharing/retrieving cookies

Why not JSDOM

Before starting this project I used jsdom, but the dependencies of that project plus the slowness of JSDOM made it very frustrating to work with. The compiling of contextify module (dependency of JSDOM) failed 9/10 times. And if you wanted to use it with node-webkit you had to manually rebuild contextify with nw-gyp, which is not the optimal solution.

So I decided to write my own version of Arc90's Readability using the fast Cheerio engine with the least number of dependencies.

The Usage of this module is similiar to JSDOM, so it's easy to switch.

The lib is using Cheerio engine because it can converted url to utf-8 automatically.

Contributors

https://gitlab.com/unrealce/read-comfortably

https://github.com/wzbg/read-comfortably

cheerio - Tiny, fast, and elegant implementation of core jQuery designed specifically for the server.
fetch - Fetch url contents. Supports gzipped content for quicker download, redirects (with automatic cookie handling, so no eternal redirect loops), streaming and piping etc.
image-size - get dimensions of any image file.
is-image-url - Check if a url is an image.
is-url - Check whether a string is a URL.
log4js - Port of Log4js to work with node.
string - string contains methods that aren't included in the vanilla JavaScript string such as escaping html, decoding html entities, stripping tags, etc.
url - The core url packaged standalone for use with Browserify.

License

The MIT License (MIT)

dannypr / read-comfortably

read-comfortably

Example

Install

Usage

Options

article object

content

title

html

dom

getDesc(length)

getImages(callback)

getHtmls(files, callback)

meta object

status

responseHeaders

finalUrl

redirectCount

cookieJar

Why not JSDOM

Contributors

Related

License

About

Languages