mohaps / xtractor

XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like paragraph preservation, better image detection heuristics, sibling score based enhancements to article detection

Home Page:http://xtractor.herokuapp.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mohaps/xtractor Stargazers