AmericasNLP / americasnlp2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AmericasNLP 2022

Important Dates

  • Submission deadline for ASR and Speech-to-text translation tasks: October 14, 2022
  • Submission deadline for machine translation task: October 25, 2022
  • Results announcement: October 29, 2022

Submission

The official submission leaderboards can be found at the following links:

Languages

Code Language Translation Pair
bzd Bribri Spanish
gn GuaranĂ­ Spanish
gvc Kotiria Portuguese
tav Wa'ikhana Portuguese
quy Quechua Spanish

Data

Test files for the ASR task are available here.

Downloading

The data for the competition can be found here. Alternatively, you can use the provided download script to automatically download the data for all languages. The script takes a single argument, which is the folder in which to download the data to:

./download_data.sh destination_folder

Data format

Each language folder contains two subfolders, each corresponding to a different training split. In each subfolder, there are multiple audio files, and a single tsv file containing all transcriptions and translations. Audio files are split such that each file contains a single sentence or utterance. The tsv file is structured as follows:

Header Content
wav The corresponding audio filename.
source_processed A processed version of the audio transcription.
source_raw The original raw transcript. We ask that you use this data for training and evaluation, and to ignore the previous column.
target_raw The translation of the transcription into either Spanish or Portuguese.

Baselines

ASR Baseline

The baseline model for the ASR task has been implemented in espnet. The scripts to run the model can be found in the following directory of the espnet repository.

About


Languages

Language:Python 87.9%Language:Shell 12.1%