Joycy-xh297 / amalgum

English web corpus with 4M tokens and several annotation types

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AMALGUM v0.1

Download

Latest data without Reddit texts is available under amalgum/ and amalgum_balanced/. (The _balanced variant contains nearly 500,000 tokens for each genre, while the unbalanced variant contains slightly more data.)

You may download the data without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.

Description

AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.

Genres

Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:

Annotations

AMALGUM contains annotations for the following information:

  • Tokenization
  • UD and Extended PTB part of speech tags
  • Lemmas
  • UD dependency parses
  • (Non-)named entities
  • Coreference
  • Rhetorical structure theory

These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.

You can see samples of the data for bio_doc124: xml, conllu, tsv, rs3

Further Information

Please see our paper.

Citation

@inproceedings{gessler-etal-2020-amalgum,
    title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
    author = "Gessler, Luke  and
      Peng, Siyao  and
      Liu, Yang  and
      Zhu, Yilun  and
      Behzad, Shabnam  and
      Zeldes, Amir",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
    pages = "5267--5275",
    abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

License

All annotations under the folders amalgum/ and amalgum_balanced/ are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:

Development

See DEVELOPMENT.md.

About

English web corpus with 4M tokens and several annotation types


Languages

Language:Python 99.5%Language:Shell 0.5%