AMALGUM v0.1

Download

Latest data without Reddit texts is available under amalgum/ and amalgum_balanced/. (The _balanced variant contains nearly 500,000 tokens for each genre, while the unbalanced variant contains slightly more data.)

You may download the data without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.

Description

AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.

Genres

Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:

academic: MDPI
bio: Wikipedia
fiction: Project Gutenberg
interview: Wikinews, Interview category
news: Wikinews
reddit: Reddit
whow: wikiHow
voyage: wikiVoyage

Annotations

AMALGUM contains annotations for the following information:

Tokenization
UD and Extended PTB part of speech tags
Lemmas
UD dependency parses
(Non-)named entities
Coreference
Rhetorical structure theory

These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.

You can see samples of the data for bio_doc124: xml, conllu, tsv, rs3

Further Information

Please see our paper.

Citation

@inproceedings{gessler-etal-2020-amalgum,
    title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
    author = "Gessler, Luke  and
      Peng, Siyao  and
      Liu, Yang  and
      Zhu, Yilun  and
      Behzad, Shabnam  and
      Zeldes, Amir",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
    pages = "5267--5275",
    abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

License

All annotations under the folders amalgum/ and amalgum_balanced/ are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:

academic: MDPI, CC BY 4.0
bio: Wikipedia, CC BY-SA 3.0
fiction: Project Gutenberg, The Project Gutenberg License
interview: Wikinews, CC BY 2.5
news: Wikinews, CC BY 2.5
whow: wikiHow, CC BY-NC-SA 3.0
voyage: wikiVoyage, CC BY-SA 3.0

Development

See DEVELOPMENT.md.

Joycy-xh297 / amalgum