article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping

Article extraction dataset from News Websites

Introduction

We release a curated dataset of news articles from more than 200 websites. The dataset contains 424 articles cleaned from any kind of boileplate code, text and ads. Most articles are in English, and were published throughout 2023, most at the end of the year. The dataset is intended for research purposes, and can be used for evaluating text extraction algorithms and boilerplate removal.

The article extraction task involves identifying and isolating the main content of a webpage, typically articles or significant textual content, while removing all irrelevant material such as advertisements, navigation menus, footers, headers, and any other non-essential elements. This process is important in applications, including web search, data mining, content analysis, and the creation of clean datasets for machine learning and natural language processing tasks. Additionally, the extraction process distills the web page down to its core informative content, making it more accessible for analysis or reading purposes.

Article extraction, or boilerplate removal has to overcome several challenges, such as dealing with diverse webpage layouts, identifying the structural and semantic markup of HTML to discern content from boilerplate, and adapting to the dynamic nature of web content and design practices. Techniques for article extraction often involve a combination of heuristic rules, machine learning models, and natural language processing tools to analyze the DOM structure of web pages, evaluate the density and distribution of textual content, and apply models trained to recognize patterns typical of valuable content.

Dataset Description

The dataset consists of 424 articles from 212 different websites. Article sources were selected from the list of popular news websites, blogs, and online magazines. Html files were downloaded with python's requests library, or, in case of cloudflare protection, with playwright. The articles were then manually cleaned from any kind of boilerplate code, text and ads. Additionally, the dataset contains the following metadata for each article:

link: the original URL of the article
site: the website where the article was published
title: the title of the article
date: the publication date of the article
authors: the author(s) of the article
text: the main, cleaned, content of the article
html_file: the name of the html file containing the article (files are stored in the html folder)

Dataset Statistics


Article Count	424
Website Count	212
Avg. Article Length	5731 characters
Avg. Article Length (words)	1116 words

About

Article title, authors, date and body extraction dataset.

article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping

Apache License 2.0

Languages

Language:HTML 100.0%Language:Jupyter Notebook 0.0%