sourcepirate / ripit

Python port of Boilerpipe library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ripit (Forked from BoilerPy3 a beautiful library.)

PyPI - Version Updates

Original Boilerpy3 was not maintianed. I forked it to add some features changes. No changes to license

About

BoilerPy3 is a native Python port of Christian KohlschĂĽtter's Boilerpipe library, released under the Apache 2.0 Licence.

This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.

Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.

Installation

To install the latest version from PyPI, execute:

pip install ripit

Usage

The top-level interfaces are the Extractors. Use the get_content() methods to extract the filtered text.

from ripit import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_content_from_url('http://www.example.com/')

# From a file
content = extractor.get_content_from_file('tests/test.html')

# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')

Alternatively, use get_doc() to return a Boilerpipe document from which you can get more detailed information.

from ripit import extractors

extractor = extractors.ArticleExtractor()

doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title

Extractors

DefaultExtractor

Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.

ArticleExtractor

A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.

ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.

LargestContentExtractor

A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor

CanolaExtractor

A full-text extractor trained on krdwrd Canola. Works well with SimpleEstimator, too.

KeepEverythingExtractor

Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.

NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

About

Python port of Boilerpipe library

License:Other


Languages

Language:Python 82.3%Language:HTML 17.7%