chrismattmann / bigtranslate

An Apache OODT, Apache Tika, and Apache Solr based system to automatically take large TSV file datasets, and to translate them from one language to another. Built and inspired by the DARPA XDATA Employment dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BigTranslate

A distributed, parallelized (Map Reduce) wrapper around Apache™ Tika and its Translation API provided by Tika-Python. BigTranslate uses Apache™ OODT to split and distribute machine translation of many millions of rows of data. The system has been tested on up to 190 million rows of TSV data involving millions of translations on 16-core nodes and finishes in reasonable amounts of time. BigTranslate uses ETLLib to provide a clean facade to JSON and TSV data processing, and to prepare data for translation using Tika. Once the data is translated it is ingested into Apache™ Solr for querying and large scale analytics and retrieval.

Apache™ Tika provides a facade to and has been tested with the following Machine Translation APIs.

See the wiki for more information on installing and running BigTranslate:

You can clone the wiki by running
git clone https://github.com/chrismattmann/bigtranslate.wiki.git

About

An Apache OODT, Apache Tika, and Apache Solr based system to automatically take large TSV file datasets, and to translate them from one language to another. Built and inspired by the DARPA XDATA Employment dataset.

License:Apache License 2.0


Languages

Language:Shell 48.0%Language:XSLT 24.5%Language:JavaScript 21.0%Language:CSS 3.6%Language:HTML 2.9%