machine-translation speech-translation corpus parallel-corpus parallel-corpora end-to-end-machine-learning forced-alignment speech-processing mfcc-features bitext sentence-embeddings sentence-similarity statistical-machine-translation speech-recognition text-processing text-preprocessinig web-scraping named-entity-recognition audio-data sql-database

IESTAC

IESTAC (Italian-English speech and text audiobook corpus) is a corpus designed to train English-to-Italian End-to-End Speech-to-Text Machine Translation models. The corpus consists of 60561 triplets of English audio, English source texts, and Italian textual translations. These segments were extracted from 373 chapters read by 98 speakers for a total amount of 131.23 hours of English audio aligned with both its English source text and its Italian textual translation.

Methodology
SQL Database
Links to Corpus Download
Corpus Statistics
Reference
Licence

Methodology

To read in details the methodological approach, the experiments, and the evaluation of this corpus, please refer to IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation. What it follows is a table that provides a quick glance at the tasks tackled and the technologies used.

Task	Technology
English-to-Italian Books Titles Translation	Named Entity Recognition, WikiData, SPARQL
Data Collection	Web Scraping, BeautifulSoup
Chapter Extraction from Text Files	Python, RegEx
Sentence Segmentation	SpaCy
Bilingual Dictionary Generation	Moses SMT, Giza++
Gale-Church Based Bilingual Text Alignment	Hunalign
Sentences Embeddings Based Bilingual Text Alignment	Facebook LASER, Vecalign
Forced Alignment	Aeneas
Audio Processing and Features Extraction	Speechpy, Numpy

SQL Database

database.sql.gz contains a compressed SQL database that allows users to query the corpus according to their needs. The database is composed of 6 tables.

The metadata table provides information regarding the original author, the Italian translator, book titles etc... In case we did not find the Italian translator, a reference to the publication is given.

The public_domain_status table provides information regarding authors and their date of death. You have to respect the copyright laws of your country. This corpus is intended for countries were literary works enter in the public domain 50 years after the death of the author. If you are in a country where literary works enter in the public domain 70 years after the death of the author, you can use this corpus as long as you filter out all the segments of a few authors. Please remember that translators have to be considered authors as well, as translations are creative works. You'll have to check the laws of the country where you are located before using this corpus. Filter out the works that you cannot use according to the laws of your country.

Here it follows a description of the tables in the database. Please notice that the SQL database does not contain the audio files, but only their names. To download the audio files see Links to Corpus Download

alignments
+---------------+-------+------+-----+---------+----------------+
| Field         | Type  | Null | Key | Default | Extra          |
+---------------+-------+------+-----+---------+----------------+
| id            | int   | NO   | PRI | NULL    | auto_increment |
| audiofilename | text  | NO   |     | NULL    |                |
| eng_text      | text  | NO   |     | NULL    |                |
| it_text       | text  | NO   |     | NULL    |                |
| book_id       | int   | NO   |     | NULL    |                |
| chapter_id    | int   | NO   |     | NULL    |                |
+---------------+-------+------+-----+---------+----------------+

audio_chapters
+------------+------+------+-----+---------+-------+
| Field      | Type | Null | Key | Default | Extra |
+------------+------+------+-----+---------+-------+
| chapter_id | int  | NO   |     | NULL    |       |
| book_id    | int  | NO   |     | NULL    |       |
| speaker_id | text | NO   |     | NULL    |       |
+------------+------+------+-----+---------+-------+

audio_segments
+---------------+-------+------+-----+---------+-------+
| Field         | Type  | Null | Key | Default | Extra |
+---------------+-------+------+-----+---------+-------+
| audiofilename | text  | NO   |     | NULL    |       |
| duration      | float | NO   |     | NULL    |       |
+---------------+-------+------+-----+---------+-------+

metadata
+----------------+------+------+-----+---------+-------+
| Field          | Type | Null | Key | Default | Extra |
+----------------+------+------+-----+---------+-------+
| book_id        | int  | NO   |     | NULL    |       |
| gutenberg_link | text | NO   |     | NULL    |       |
| librivox_link  | text | NO   |     | NULL    |       |
| author         | text | NO   |     | NULL    |       |
| author_it      | text | NO   |     | NULL    |       |
| title_en       | text | NO   |     | NULL    |       |
| title_it       | text | NO   |     | NULL    |       |
+----------------+------+------+-----+---------+-------+

public_domain_status
+---------------+------+------+-----+---------+-------+
| Field         | Type | Null | Key | Default | Extra |
+---------------+------+------+-----+---------+-------+
| author        | text | YES  |     | NULL    |       |
| date_of_death | int  | YES  |     | NULL    |       |
+---------------+------+------+-----+---------+-------+

speakers
+--------------+------+------+-----+---------+-------+
| Field        | Type | Null | Key | Default | Extra |
+--------------+------+------+-----+---------+-------+
| speaker_id   | int  | NO   |     | NULL    |       |
| speaker_link | text | NO   |     | NULL    |       |
| speaker_name | text | NO   |     | NULL    |       |
+--------------+------+------+-----+---------+-------+

Links to Corpus Download

A binary file containing the 60561 triplets of parallel English audio, English text, and Italian textual translation is available as a Kaggle dataset. The Kaggle dataset also contains two parallel texts for textual Machine Translation.

Downloadable link to numpy arrays containing 40-MFCC features extracted from the audio segments and aligned with their English transcription will be added.

Corpus Statistics

Datum	Value
Aligned Segments	60561
Average Segment Duration	7.80s
Total Hours of English Speech	131.23
Number of Speakers	98
Number of Aligned Chapters	373
Average Number of Chapters Read by Speaker	3.80

Reference

Course: Master's Thesis in Language Technology, Uppsala University

Supervisor: Sara Stymne

To use this work please cite:

@inproceedings{della-corte-stymne-2020-multi,
    title = "IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation",
    author = "Della Corte, Giuseppe  and
      Stymne, Sara",
    booktitle = "Proceedings of the First International Workshop on Natural Language Processing Beyond Text",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlpbt-1.5",
    pages = "41--50",}

Licence

This corpus is intended for countries were literary works enter in the public domain 50 years after the death of the author. Please remember that translators have to be considered authors as well, as translations are creative works. You'll have to check the laws of the country where you are located before using this corpus.

You agree to indemnify and hold the authors of the corpus and any contributor harmless from all liability, costs and expenses, including legal fees, that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this corpus, (b) alteration, modification, or additions or deletions to this corpus, and (c) any defect you cause.

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE. THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE").

About

A corpus that can be used to train English-to-Italian End-to-End Speech-to-Text Machine Translation models

machine-translation speech-translation corpus parallel-corpus parallel-corpora end-to-end-machine-learning forced-alignment speech-processing mfcc-features bitext sentence-embeddings sentence-similarity statistical-machine-translation speech-recognition text-processing text-preprocessinig web-scraping named-entity-recognition audio-data sql-database