There are 4 repositories under html-extraction topic.
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Domain-specific language for extracting structured data from HTML documents
Article extraction benchmark: dataset and evaluation scripts
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
Extract price amount and currency symbol from a raw text string
fast python port of arc90's readability tool, updated to match latest readability.js!