Speech-to-Text (S2T) toolkit

Overview

This repository is an extension of the Fairseq toolkit specialized for speech-to-text (S2T) generation tasks. This toolkit provides comprehensive support for Automatic Speech Recognition (ASR), Machine Translation (MT), and Speech Translation (ST).

Features

Complete recipes: Kaldi-style recipe support for ASR, MT, and ST tasks, ensuring a smooth workflow.
Various configurations: An extensive collection of YAML configuration files to customize models for different tasks and scenarios.
Easy reproduction: The comprehensive support of methods in our papers, including SATE, PDS, CTC-NAST, BIL-CTC, and more
Multiple inference strategies: Greedy decoding, beam search, CTC decoding, CTC rescoring, and more
More features can be found in the run.sh file.

Installation

Clone the repository:

git clone https://github.com/xuchennlp/S2T.git

Navigate to the project directory and install the required dependencies:
```
cd S2T
pip install -e .
```
Our version: python 3.8, pytorch 1.11.0.

Quick Start

Download your dataset and process it into the format of MUST-C dataset.

Run the shell script run.sh in the corresponding directory as follows:

# Set ST_DIR environment variable as the parent directory of S2T directory
export ST_DIR=/path/to/S2T/..
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2

Stage 0 performs the data processing, including feature extraction of audios (Not required in MT), vocabulary generation, training and testing files generation.
Stage 1 performs the model training, where multiple choices are supported.
Stage 2 performs the model inference, where multiple strategies are supported.
All details are available in run.sh.

Reproduction of our methods

SATE: Stacked Acoustic and Textual Encoding (ACL 2021)

Paper: Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Highlights: an simple and effective methods to utilize the pre-trained ASR and MT models to improve the end-to-end ST model; introducing the adapter to bridge the pre-trained encoders

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_sate

PDS: Progressive Down-Sampling (ACL 2023 findings)

Paper: Bridging the Granularity Gap for Acoustic Modeling

Highlights: an effective method to facilitate the convergence of S2T tasks by increasing the modeling granularity of acoustic representations

Here is an example on the MUST-C ST dataset. This method also supports the ASR task.

cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_pds

NAST: Non-Autoregressive Speech Translation (ACL 2023)

Paper: CTC-based Non-autoregressive Speech Translation

Highlights: a non-autoregressive modeling method that only relies on the CTC inference and achieves the comparable results with the autoregressive methods

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
# Non-autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_nast
# Autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_ctc_aug

BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)

Paper: Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Highlights: introducing both cross-modal and cross-lingual CTC for S2T tasks and developing an novel implementation strategy called Synchronous BiL-CTC that outperforms the traditional progressive strategy (the implementation in NAST)

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
# Progressive BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_progressive
# Synchronous BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_synchronous

Acknowledgments

Fairseq community for the base toolkit
ESPnet community for the base toolkit
NiuTrans Team for their contributions and research

Finally, thank you to everyone who has helped me during my research career. I sincerely hope that everyone can enjoy the pleasure of research

Feedback

If you have any questions, feel free to contact xuchennlp[at]outlook.com.

xuchennlp / S2T

Speech-to-Text (S2T) toolkit

Overview

Features

Installation

Quick Start

Reproduction of our methods

SATE: Stacked Acoustic and Textual Encoding (ACL 2021)

PDS: Progressive Down-Sampling (ACL 2023 findings)

NAST: Non-Autoregressive Speech Translation (ACL 2023)

BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)

Acknowledgments

Feedback

About

Languages