boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages

NOTE: This is a work-in-progress transmit from Google Code.

The latest stable version of boilerpipe is available at https://code.google.com/p/boilerpipe.

Difference from original

You can switch tokenizer which implements ITokenizer interface. For example, Tokenizer using Kuromoji is available.

// for example
String originalDocument = "<html><body>こんにちは、世界</body></html>";
HTMLDocument h = new HTMLDocument(originalDocument.getBytes(), Charset.forName("UTF-8"));
BoilerpipeSAXInput in = new BoilerpipeSAXInput(h.toInputSource());
BoilerpipeHTMLContentHandler handler = new BoilerpipeHTMLContentHandler(new KuromojiTokenizer(Tokenizer.Mode.SEARCH));
BoilerpipeHTMLParser parser = new BoilerpipeHTMLParser(handler);
String boiledDocument = ArticleExtractor.INSTANCE.getText(in.getTextDocument(parser));

Kuromoji installation

To install Kuromoji without depending on http://www.atilika.org/nexus/content/repositories/atilika/, do:

make -f kuromoji.mk all install

which will download and install JAR from https://github.com/atilika/kuromoji/downloads.

About

Kuromoji enable boilerpipe

https://code.google.com/p/boilerpipe/

Other

Languages

Language:Java 99.0%Language:Shell 0.6%Language:Makefile 0.3%