hubitor / learnhtml

Web content extraction using machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Requirements

First you will need to install the dependencies. For the binary dependencies:

sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

pip install -e .

Running the scripts

./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>

Copyright (C) 2018 Nichita Uțiu

About

Web content extraction using machine learning

License:Apache License 2.0


Languages

Language:HTML 55.4%Language:Python 43.9%Language:Shell 0.8%