machine-learning data-science bert named-entity-recognition natural-language-processing

AuthorExtractor

Source code for Extracting the author of news stories with Machine Learning and DOM-based segmentation

Introduction

This repository contains

Source code for Web2Text, including additional features specific to the Author Extraction task
Feature representations for +70,000 news articles from All The News under public/train_and_test (23.39GB), in CSV format. The corresponding HTML files were not uploaded because it contains copyrighted material
Train Model Task
Weights from a pre-trained model with the above dataset (under public/trained_model_all_the_news)
Inference Task

Installation

Install Scala and SBT. The code was tested with SBT 1.3.6.
Install Python 3 with Tensorflow (tested with 2.1.0), Keras, NumPy, sklearn, HuggingFace Transformers and wget. Running an Anaconda instance is recommended.

Usage

Recipe: Extract Page Features of a single local HTML file

./extract_page_features.sh <html_file.html> (This will generate a CSV file)

Recipe: Extract Page Features of an entire Corpus

Extract feature representations of all HTML files located in public/html:

./extract_corpus_features.sh

Both single-page and Corpus feature extraction will generate CSV files and store them under public/train_and_test.

Both the page and the corpus feature extraction generate a file named /public/DOM/dom.html which contains a visual DOM tree. This file is used for troubleshooting during implementation and on inference time.

Recipe: Train Model

Train the model with all the feature representations located in public/train_and_test:

./train_model.sh

The true labels are expected to be in public/authors.csv. The syntax of this file is URL Hash; Author name.

This generates model files located under public/trained_model_all_the_news.

Recipe: Inference from a local HTML file

./inference_from_html.sh <html_file.html>

Recipe: Inference from a local CSV file

./inference_from_csv.sh <csv_file.csv>

Recipe: Inference from a URL

./inference_from_url.sh <URL>

All inference scripts are slow to run (about 3 minutes) due to the fact that there are several steps involved (load weights, load Tensorflow and BERT libraries) and a suboptimal switch back and forth between Scala and Python.

About

Source code for the Medium article "Extracting the author of news stories with DOM-based segmentation and BERT"

https://medium.com/p/69225ea0e5c2/

machine-learning data-science bert named-entity-recognition natural-language-processing

Languages

Language:HTML 99.2%Language:Scala 0.3%Language:Perl 0.3%Language:TeX 0.1%Language:Python 0.1%Language:CSS 0.0%Language:Smarty 0.0%Language:Shell 0.0%Language:Makefile 0.0%Language:JavaScript 0.0%Language:Perl 6 0.0%