IESTAC (Italian-English speech and text audiobook corpus) is a corpus designed to train English-to-Italian End-to-End Speech-to-Text Machine Translation models. The corpus consists of 60561 triplets of English audio, English source texts, and Italian textual translations. These segments were extracted from 373 chapters read by 98 speakers for a total amount of 131.23 hours of English audio aligned with both its English source text and its Italian textual translation.
To read in details the methodological approach, the experiments, and the evaluation of this corpus, please refer to IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation. What it follows is a table that provides a quick glance at the tasks tackled and the technologies used.
Task | Technology |
---|---|
English-to-Italian Books Titles Translation | Named Entity Recognition, WikiData, SPARQL |
Data Collection | Web Scraping, BeautifulSoup |
Chapter Extraction from Text Files | Python, RegEx |
Sentence Segmentation | SpaCy |
Bilingual Dictionary Generation | Moses SMT, Giza++ |
Gale-Church Based Bilingual Text Alignment | Hunalign |
Sentences Embeddings Based Bilingual Text Alignment | Facebook LASER, Vecalign |
Forced Alignment | Aeneas |
Audio Processing and Features Extraction | Speechpy, Numpy |
database.sql.gz contains a compressed SQL database that allows users to query the corpus according to their needs. The database is composed of 6 tables.
The metadata table provides information regarding the original author, the Italian translator, book titles etc... In case we did not find the Italian translator, a reference to the publication is given.
The public_domain_status table provides information regarding authors and their date of death. You have to respect the copyright laws of your country. This corpus is intended for countries were literary works enter in the public domain 50 years after the death of the author. If you are in a country where literary works enter in the public domain 70 years after the death of the author, you can use this corpus as long as you filter out all the segments of a few authors. Please remember that translators have to be considered authors as well, as translations are creative works. You'll have to check the laws of the country where you are located before using this corpus. Filter out the works that you cannot use according to the laws of your country.
Here it follows a description of the tables in the database. Please notice that the SQL database does not contain the audio files, but only their names. To download the audio files see Links to Corpus Download
alignments
+---------------+-------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| audiofilename | text | NO | | NULL | |
| eng_text | text | NO | | NULL | |
| it_text | text | NO | | NULL | |
| book_id | int | NO | | NULL | |
| chapter_id | int | NO | | NULL | |
+---------------+-------+------+-----+---------+----------------+
audio_chapters
+------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------+------+-----+---------+-------+
| chapter_id | int | NO | | NULL | |
| book_id | int | NO | | NULL | |
| speaker_id | text | NO | | NULL | |
+------------+------+------+-----+---------+-------+
audio_segments
+---------------+-------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------+------+-----+---------+-------+
| audiofilename | text | NO | | NULL | |
| duration | float | NO | | NULL | |
+---------------+-------+------+-----+---------+-------+
metadata
+----------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------+------+-----+---------+-------+
| book_id | int | NO | | NULL | |
| gutenberg_link | text | NO | | NULL | |
| librivox_link | text | NO | | NULL | |
| author | text | NO | | NULL | |
| author_it | text | NO | | NULL | |
| title_en | text | NO | | NULL | |
| title_it | text | NO | | NULL | |
+----------------+------+------+-----+---------+-------+
public_domain_status
+---------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------+------+-----+---------+-------+
| author | text | YES | | NULL | |
| date_of_death | int | YES | | NULL | |
+---------------+------+------+-----+---------+-------+
speakers
+--------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------+------+-----+---------+-------+
| speaker_id | int | NO | | NULL | |
| speaker_link | text | NO | | NULL | |
| speaker_name | text | NO | | NULL | |
+--------------+------+------+-----+---------+-------+
A binary file containing the 60561 triplets of parallel English audio, English text, and Italian textual translation is available as a Kaggle dataset. The Kaggle dataset also contains two parallel texts for textual Machine Translation.
Downloadable link to numpy arrays containing 40-MFCC features extracted from the audio segments and aligned with their English transcription will be added.
Datum | Value |
---|---|
Aligned Segments | 60561 |
Average Segment Duration | 7.80s |
Total Hours of English Speech | 131.23 |
Number of Speakers | 98 |
Number of Aligned Chapters | 373 |
Average Number of Chapters Read by Speaker | 3.80 |
Course: Master's Thesis in Language Technology, Uppsala University
Supervisor: Sara Stymne
To use this work please cite:
@inproceedings{della-corte-stymne-2020-multi,
title = "IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation",
author = "Della Corte, Giuseppe and
Stymne, Sara",
booktitle = "Proceedings of the First International Workshop on Natural Language Processing Beyond Text",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlpbt-1.5",
pages = "41--50",}
This corpus is intended for countries were literary works enter in the public domain 50 years after the death of the author. Please remember that translators have to be considered authors as well, as translations are creative works. You'll have to check the laws of the country where you are located before using this corpus.
You agree to indemnify and hold the authors of the corpus and any contributor harmless from all liability, costs and expenses, including legal fees, that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this corpus, (b) alteration, modification, or additions or deletions to this corpus, and (c) any defect you cause.
This work is licensed under a Creative Commons Attribution 4.0 International License.
CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE. THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE").