Latest data without Reddit texts is available under amalgum/
and amalgum_balanced/
. (The _balanced
variant contains nearly 500,000 tokens for each genre, while the unbalanced variant contains slightly more data.)
You may download the data without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.
AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.
Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:
academic
: MDPIbio
: Wikipediafiction
: Project Gutenberginterview
: Wikinews, Interview categorynews
: Wikinewsreddit
: Redditwhow
: wikiHowvoyage
: wikiVoyage
AMALGUM contains annotations for the following information:
- Tokenization
- UD and Extended PTB part of speech tags
- Lemmas
- UD dependency parses
- (Non-)named entities
- Coreference
- Rhetorical structure theory
These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.
You can see samples of the data for bio_doc124
: xml, conllu, tsv, rs3
Please see our paper.
@inproceedings{gessler-etal-2020-amalgum,
title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
author = "Gessler, Luke and
Peng, Siyao and
Liu, Yang and
Zhu, Yilun and
Behzad, Shabnam and
Zeldes, Amir",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
pages = "5267--5275",
abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
language = "English",
ISBN = "979-10-95546-34-4",
}
All annotations under the folders amalgum/
and amalgum_balanced/
are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:
academic
: MDPI, CC BY 4.0bio
: Wikipedia, CC BY-SA 3.0fiction
: Project Gutenberg, The Project Gutenberg Licenseinterview
: Wikinews, CC BY 2.5news
: Wikinews, CC BY 2.5whow
: wikiHow, CC BY-NC-SA 3.0voyage
: wikiVoyage, CC BY-SA 3.0
See DEVELOPMENT.md.