audio-recognition machine-translation speech-to-text

Joey S2T

JoeyS2T is a JoeyNMT extension for Speech-to-Text tasks such as Automatic Speech Recognition (ASR) and end-to-end Speech Translation (ST). It inherits the core philosophy of JoeyNMT, a minimalist novice-friendly toolkit built on PyTorch, seeking simplicity and accessibility.

What's new

Upgraded to JoeyNMT v2.3.
Our paper has been accepted at EMNLP 2022 System Demo Track!

Features

JoeyS2T implements the following features:

Transformer Encoder-Decoder
1d-Conv Subsampling
Cross-entropy and CTC joint objective
Mel filterbank spectrogram extraction
CMVN, SpecAugment
WER evaluation

Furthermore, all the functionalities in JoeyNMT v2 are also available from JoeyS2T:

BLEU and ChrF evaluation
BPE tokenization (with BPE dropout option)
Beam search and greedy decoding (with repetition penalty, ngram blocker)
Customizable initialization
Attention visualization
Learning curve plotting
Scoring hypotheses and references
Multilingual translation with language tags

Installation

JoeyS2T is built on PyTorch. Please make sure you have a compatible environment. We tested JoeyS2T v2.3 with

python 3.11
torch 2.1.2
torchaudio 2.1.2
cuda 12.1

Clone this repository and install via pip:

$ git clone https://github.com/may-/joeys2t.git
$ cd joeys2t
$ python -m pip install -e .
$ python -m unittest

📝 Note You may need to install extra dependencies (torchaudio backends): ffmpeg, sox, soundfile, etc. See torchaudio installation instructions.

Documentation & Tutorials

Please check the JoeyNMT's documentation first, if you are not familiar with JoeyNMT yet.

For details, follow the tutorials in notebooks dir.

Benchmarks & Pretrained models

We provide benchmarks and pretraind models for Speech Recognition (ASR) and Speech Translation (ST) with JoeyS2T.

The models are also available via Torch Hub!

import torch

model = torch.hub.load('may-/joeys2t', 'mustc_v2_ende_st')
translations = model.generate(['test.wav'])
print(translations[0])
# 'Hallo, Welt!'

⚠️ Warning The 1d-conv layer may raise an error for too short audio inputs. (We cannot convolve the frames shorter than the kernel size!)

Reference

If you use JoeyS2T in a publication or thesis, please cite the following paper:

@inproceedings{ohta-etal-2022-joeys2t,
    title = "{JoeyS2T}: Minimalistic Speech-to-Text Modeling with {JoeyNMT}",
    author = "Ohta, Mayumi  and
      Kreutzer, Julia  and
      Riezler, Stefan",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.6",
    pages = "50--59",
}

Contact

Please leave an issue if you have found a bug in the code.

For general questions, email me at ohta <at> cl.uni-heidelberg.de.

About

Minimalist Speech-to-Text toolkit for educational purposes

http://joeys2t.readthedocs.io/

audio-recognition machine-translation speech-to-text

Apache License 2.0

Languages

Language:Python 100.0%