Large-scale, machine-readable bilingual dictionaries provided by the Chair for Applied Computational Linguistics (ACoLi) at the University of Augsburg, Germany. As a technical basis, we employ OntoLex-Lemon (https://www.w3.org/2016/05/ontolex/) for data modelling, OLiA (http://purl.org/olia) for representing grammatical information, lexvo (http://lexvo.org) for ISO 639 language identifiers and GlottoLog (http://glottolog.org) for identifiers of non-ISO-639 language varieties.
At the moment, we provide OntoLex-lemon and TIAD-TSV editions of open source dictionaries for more than 400 language varieties and more than 2500 language pairs (stable and experimental), with more than 3000 lexical data sets in total, see statistics below. Note that we exclude most smaller data sets (with less than 10,000 translation pairs) in these counts. Additional data has been converted, but is still awaiting copyright clearance.
The data is currently maintained by the Chair for Applied Computational Linguistics (ACoLi) at the University of Augsburg, Germany. The initial development of this data took place 2015-2022 at the Applied Computational Linguistics lab of Goethe University Frankfurt, Germany, in the context of the BMBF-funded research group Linked Open Dictionaries (LiODi, 2015-2022) and in collaboration with the Institute for Empirical Linguistics at Goethe University Frankfurt. The LiODi project aimed at creating Linked Open Data representations of dictionaries and the development of an infrastructure and methodologies for their practical application in language contact studies, mostly in Eurasia and the Caucasus area.
languages | lexical data sets | license | OntoLex/RDF data | TIAD/TSV data | comments | |
---|---|---|---|---|---|---|
Apertium | 46 | 55 | GPL | apertium/apertium-rdf-2019-02-03 (*.rdf.zip) | apertium/apertium-rdf-2019-02-03 (trans*tsv.gz) | modeling based on http://linguistic.linkeddata.es/apertium/, designed for machine translation |
FreeDict | 45 | 145 | GPL | freedict/freedict-rdf-2019-02-05 (*/*.ttl.gz) | freedict/freedict-rdf-2019-02-05 (*/*.tsv.gz) | plain word lists, user-generated content |
DBnary | 119* | 275* | CC-BY-SA 3.0 | external | dbnary/dbnary-tiad-2019-02-16 | * counted only language pairs with 10,000+ entries, user-generated content |
PanLex | 194* | 1651** | CC0 | panlex/panlex-20191001-csv-rdf | panlex/biling-tsv | * only language pairs with 10.000 entries; ** TIAD-TSV files |
MUSE | 45 | 107 | CC-BY-NC 4.0 | muse/muse-rdf-2020-06-12 | muse-tsv-2020-06-12 | machine-generated, high-precision wordlist |
Wikidata | * | * | CC0 | https://www.wikidata.org (external) | wikidata/wikidata-tsv-2020-06-24 | * >400k translation pairs, > 90k language pairs, but very sparse |
OMW | 34 | 40* | open source | external | omw/tsv | * conservative estimate, restricted to combinations of OMW files with identical licenses |
IDS | 234* | 792*,** | CC-BY 4.0 | ids/ontolex | ids/tsv | * counted only language pairs with >10k translations, ** TIAD TSV files |
XDXF | 51 | 107 | GPL | experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.ttl.gz) | experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.tsv.gz) | experimental |
free-dict.de | 2 | 1 | "free" | experimental/free-dict.de/free-dict-de-2020-01-02 (*.ttl.gz) | experimental/free-dict.de/free-dict-de-2020-01-02 (*.tsv.gz) | experimental (partial) |
StarDict | 32 | 130 | "open"/"free" | experimental/stardict/stardict-2020-01-04 (*/*.ttl.gz) | experimental/stardict/stardict-2020-01-04 (*/*.tsv.gz) | experimental (partial) |
total | 430 | 3143 |
- stable/ data releases
- experimental/ work in progress, contains converters/build scripts and resulting data by several individual contributors, including student projects
The ACoLi Dictionary Graph is maintained and continues to be developed at the Chair of Applied Computational Linguistics at the University of Augsburg, Germany. Prior to 2023, the ACoLi Dictionary Graph has been created at the Applied Computational Linguistics Lab at Goethe Universität Frankfurt, Germany since 2014 in the context of numerous research projects, including
- the Independent Research Group "Linked Open Dictionaries" (LiODi), funded in the eHumanities programme of the German Federal Ministry of Education and Science (BMBF, 2015-2020)
- Horion 2020 Research and Innovation Action Pret-a-LLOD. Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors, funded by the European Research Council (ERC, 2019-2021)
To refer to the dataset as a whole in scientific publications, please refer to Chiarcos et al. (2020):
@inproceedings{chiarcos2020acoli,
title={The ACoLi Dictionary Graph},
author={Chiarcos, Christian and F{\"a}th, Christian and Ionov, Maxim},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={3281--3290},
year={2020}
}
All datasets are published under open or non-commercial licenses. We put our RDF and TIAD-TSV editions are put under the same license as the underlying source data. For detailed acknowledgements and licensing of individual datasets see the respective subdirectories.