kamaux0r / Coreex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What?

The goal of this Python module is to extract the main content of a webpage (like Readability for instance).

Ideally it would work on news websites & blogs, and it should work for a vast majority of languages.

Reference

The main algorithm is from the paper CoreEx: Content Extraction from Online News Articles by Jyotika Prasad & Andreas Paepcke.

You should note that this is an experiment so I may try different heuristics to improve the results (and so do not fully respect the algorithms described in the paper).

Try it

Just drop some articles from your favorite news websites in a directory called tests, (your articles should have the .in.html extension) and run the script, you'll get the content of the articles in *.out.html files.

You can also simply call summary from a Python REPL with an URL if you wish...

About


Languages

Language:Python 100.0%