rundimeco / waddle

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This repository documents ongoing work on the evaluation of main text extraction from web pages.

The contribution is two folds:

  • A meta-tool to use various text extarction systems at once
  • An evaluation procedure based on a reference cleaned version

To come : unsupervised evaluation

The evaluation tool can compare different outputs to a single reference. It can also be used to compare different versions of a text generated in other text extraction settings : OCR, ASR ....

Installation

Most tools work on Python3 only You need to have pip installed (https://pip.pypa.io/en/stable/installing/)

Run the following command to install all the packages :

pip install -r requirements.txt (can take a while)

NB: If you are a windows user, take a look at this page : https://projects.raspberrypi.org/en/projects/using-pip-on-windows/2

Evaluation

now you can run this command: python test_all_tools.py

Directories:

  • Corpus/html raw html files
  • Corpus/cleaned cleaned file, one directory by tool
  • Corpus/reference reference cleaning version (needed for evaluation)

Tools

We defined three categories: (I) tools designed to extract all the textual content (recall-oriented tools), usually not focused on press articles; (II) tools focusing on the readability of web pages and (III) tools dedicated to text content extraction.

Cat. Tool Version Github Reference
I Html2text 2020.1.16 Alir3z4/html2text/ [https://core.ac.uk/download/pdf/127601559.pdf]
I Inscriptis 1.0 weblyzard/inscriptis
II Newspaper3k 0.2.8 codelucas/newspaper
II News-please 1.4.25 fhamborg/news-please
II Readability 0.7.1 buriy/python-readability
III Boilerpy3 1.0.2 jmriebold/BoilerPy3 [https://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf]
III Dragnet 2.0.4 dragnet-org/dragnet [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4694&rep=rep1&type=pdf]
III Goose3 3.1.6 goose3/goose3
III JusText 2.2.0 miso-belica/jusText [https://is.muni.cz/th/45523/fi_d/phdthesis.pdf]
III Trafilatura 0.4.1 adbar/trafilatura [https://hal.archives-ouvertes.fr/hal-02447264/document]

Information

Web-Assembled Data-Driven Language-oriented Evaluation. Just because.

Authors: Gaël Lejeune & Adrien Barbaresi.

Corpus/reference reference cleaning version (needed for evaluation)

##TODO: Add instructions and a make for processing everything

Node js issues (readabilipy) see : https://www.digitalocean.com/community/tutorials/how-to-install-node-js-on-ubuntu-18-04-fr

Encoding errors : utf-8 should be the norm but in fact is not Some issues and possible solutions : (Non-ISO extended-ASCII text) : https://superuser.com/questions/669700/non-iso-extended-ascii-text

Current work: Windows OS issues: " DLL load failed while importing"

About

License:GNU General Public License v3.0


Languages

Language:HTML 95.7%Language:Jupyter Notebook 3.1%Language:Python 1.3%