KelleyYin / domain-adaptation-data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

These data come from OPUS (http://opus.nlpl.eu/).

There are data from 5 domains:

Law (JRC-Acquis), Medical (EMEA), IT (GNOME, KDE, PHP, Ubuntu, and OpenOffice), Koran (Tanzil), and Subtitles (OpenSubtitles).

Please cite OPUS if you use any of the data, and please link to the individual data source as well:

OPUS:

@InProceedings{TIEDEMANN12.463,
  author = {J\"org Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }
@InCollection{Tiedemann:RANLP5,
  author =    {J\"org Tiedemann},
  title =   {News from {OPUS} - {A} Collection of Multilingual
                  Parallel Corpora with Tools and Interfaces},
  booktitle =   {Recent Advances in Natural Language Processing},
  publisher =   {John Benjamins, Amsterdam/Philadelphia},
  year =          2009,
  editor =        {N. Nicolov and K. Bontcheva and G. Angelova and
                  R. Mitkov},
  volume =    {V},
  address =   {Borovets, Bulgaria},
  isbn =          {978 90 272 4825 1},
  pdf =           {http://stp.lingfil.uu.se/~joerg/published/ranlp-V.pdf},
  topic  =        {Parallel corpora}
}

Law (JRC-Acquis)

Source: https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis Downloaded from: http://opus.nlpl.eu/JRC-Acquis.php

Medical (EMEA):

Source: http://www.emea.europa.eu/ Downloaded from: http://opus.nlpl.eu/EMEA.php

IT (GNOME, KDE, PHP, Ubuntu, and OpenOffice)

GNOME

Source: https://l10n.gnome.org Downloaded from: http://opus.nlpl.eu/GNOME.php

KDE

Downloaded from: http://opus.nlpl.eu/KDE4.php

PHP

Source: http://se.php.net/download-docs.php Downloaded from: http://opus.nlpl.eu/PHP.php

Ubuntu

Source: https://translations.launchpad.net Downloaded from: http://opus.nlpl.eu/Ubuntu.php

OpenOffice

Source: http://www.openoffice.org/ Downloaded from: http://opus.nlpl.eu/OpenOffice.php

Koran (Tanzil)

Source: http://tanzil.net/ Downloaded from: http://opus.nlpl.eu/Tanzil.php

Subtitles (OpenSubtitles)

Source: http://www.opensubtitles.org/ Downloaded from: http://opus.nlpl.eu/OpenSubtitles2016.php

These domain splits were first used in:

Six Challenges for Neural Machine Translation

@InProceedings{koehn-knowles:2017:NMT,
  author    = {Koehn, Philipp  and  Knowles, Rebecca},
  title     = {Six Challenges for Neural Machine Translation},
  booktitle = {Proceedings of the First Workshop on Neural Machine Translation},
  month     = {August},
  year      = {2017},
  address   = {Vancouver},
  publisher = {Association for Computational Linguistics},
  pages     = {28--39},
  url       = {http://www.aclweb.org/anthology/W17-3204}
}
Medical (EMEA), IT (GNOME, KDE, PHP, Ubuntu, and OpenOffice), Koran (Tanzil), and Subtitles (OpenSubtitles) were used in:

Neural Lattice Search for Domain Adaptation in Machine Translation

@InProceedings{I17-2004,
  author = 	{Khayrallah, Huda
		and Kumar, Gaurav
		and Duh, Kevin
		and Post, Matt
		and Koehn, Philipp},
  title = 	{Neural Lattice Search for Domain Adaptation in Machine Translation},
  booktitle = 	{Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  year = 	{2017},
  publisher = 	{Asian Federation of Natural Language Processing},
  pages = 	{20--25},
  location = 	{Taipei, Taiwan},
  url = 	{http://aclweb.org/anthology/I17-2004}
}

About