hubitor/learnhtml

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Requirements

First you will need to install the dependencies. For the binary dependencies:

sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

pip install -e .

Running the scripts

./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>

About

Web content extraction using machine learning

Apache License 2.0

Languages

Language:HTML 55.4%Language:Python 43.9%Language:Shell 0.8%