xuchennlp / S2T

The project for speech translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speech-to-Text (S2T) toolkit

Overview

This repository is an extension of the Fairseq toolkit specialized for speech-to-text (S2T) generation tasks. This toolkit provides comprehensive support for Automatic Speech Recognition (ASR), Machine Translation (MT), and Speech Translation (ST).

Features

  • Complete recipes: Kaldi-style recipe support for ASR, MT, and ST tasks, ensuring a smooth workflow.
  • Various configurations: An extensive collection of YAML configuration files to customize models for different tasks and scenarios.
  • Easy reproduction: The comprehensive support of methods in our papers, including SATE, PDS, CTC-NAST, BIL-CTC, and more
  • Multiple inference strategies: Greedy decoding, beam search, CTC decoding, CTC rescoring, and more
  • More features can be found in the run.sh file.

Installation

  1. Clone the repository:

    git clone https://github.com/xuchennlp/S2T.git
  2. Navigate to the project directory and install the required dependencies:

    cd S2T
    pip install -e .

    Our version: python 3.8, pytorch 1.11.0.

Quick Start

  1. Download your dataset and process it into the format of MUST-C dataset.

  2. Run the shell script run.sh in the corresponding directory as follows:

    # Set ST_DIR environment variable as the parent directory of S2T directory
    export ST_DIR=/path/to/S2T/..
    cd egs/mustc/st/
    ./run.sh --stage 0 --stop_stage 2
  • Stage 0 performs the data processing, including feature extraction of audios (Not required in MT), vocabulary generation, training and testing files generation.
  • Stage 1 performs the model training, where multiple choices are supported.
  • Stage 2 performs the model inference, where multiple strategies are supported.
  • All details are available in run.sh.

Reproduction of our methods

SATE: Stacked Acoustic and Textual Encoding (ACL 2021)

Paper: Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Highlights: an simple and effective methods to utilize the pre-trained ASR and MT models to improve the end-to-end ST model; introducing the adapter to bridge the pre-trained encoders

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_sate

PDS: Progressive Down-Sampling (ACL 2023 findings)

Paper: Bridging the Granularity Gap for Acoustic Modeling

Highlights: an effective method to facilitate the convergence of S2T tasks by increasing the modeling granularity of acoustic representations

Here is an example on the MUST-C ST dataset. This method also supports the ASR task.

cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_pds

NAST: Non-Autoregressive Speech Translation (ACL 2023)

Paper: CTC-based Non-autoregressive Speech Translation

Highlights: a non-autoregressive modeling method that only relies on the CTC inference and achieves the comparable results with the autoregressive methods

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
# Non-autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_nast
# Autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_ctc_aug

BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)

Paper: Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Highlights: introducing both cross-modal and cross-lingual CTC for S2T tasks and developing an novel implementation strategy called Synchronous BiL-CTC that outperforms the traditional progressive strategy (the implementation in NAST)

Here is an example on the MUST-C ST dataset.

cd egs/mustc/st/
# Progressive BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_progressive
# Synchronous BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_synchronous

Acknowledgments

  • Fairseq community for the base toolkit
  • ESPnet community for the base toolkit
  • NiuTrans Team for their contributions and research

Finally, thank you to everyone who has helped me during my research career. I sincerely hope that everyone can enjoy the pleasure of research

Feedback

If you have any questions, feel free to contact xuchennlp[at]outlook.com.

About

The project for speech translation

License:MIT License


Languages

Language:Python 91.0%Language:Shell 3.8%Language:Cuda 2.3%Language:Perl 1.9%Language:C++ 0.5%Language:Cython 0.3%Language:Lua 0.1%