hdubey / wer_are_we

Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wer_are_we

WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)

To be updated with Interspeech 2015...

WER

LibriSpeech

(Possibly trained on more data than LibriSpeech.)

WER test-clean WER test-other Paper Notes
5.83% 12.69 Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Humans
4.28% Purely sequence-trained neural networks for ASR based on lattice-free MMI HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations
4.83% A time delay neural network architecture for efficient modeling of long temporal contexts HMM-TDNN + iVectors
5.33% 13.25% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 68M parameters trained on 11940h
5.51% 13.97% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books HMM-DNN + pNorm*
8.01% 22.49% same, Kaldi HMM-(SAT)GMM
12.51% Audio Augmentation for Speech Recognition TDNN + pNorm + speed up/down speech

WSJ

(Possibly trained on more data than WSJ.)

WER eval'92 WER eval'93 Paper Notes
3.47% Deep Recurrent Neural Networks for Acoustic Modelling TC-DNN-BLSTM-DNN
5.03% 8.08% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Humans
3.63% 5.66% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*
3.60% 4.98% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 68M parameters
5.6% Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal CNN over RAW speech (wav)

Switchboard Hub5'00

(Possibly trained on more data than SWB, but test set = full Hub5'00.)

WER (SWB) WER (full=SWB+CH) Paper Notes
6.3% 11.9% The Microsoft 2016 Conversational Speech Recognition System VGG/Resnet/LACE/BLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast
6.6% 12.2% The IBM 2016 English Conversational Telephone Speech Recognition System RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model
8.5% 13% Purely sequence-trained neural networks for ASR based on lattice-free MMI HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher
9.2% 13.3% Purely sequence-trained neural networks for ASR based on lattice-free MMI HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only)
12.6% 16% Deep Speech: Scaling up end-to-end speech recognition CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB
11% 17.1% A time delay neural network architecture for efficient modeling of long temporal contexts HMM-TDNN + iVectors
12.6% 18.4% Sequence-discriminative training of deep neural networks HMM-DNN +sMBR
12.9% 19.3% Audio Augmentation for Speech Recognition HMM-TDNN + pNorm + speed up/down speech
15% 19.1% Building DNN Acoustic Models for Large Vocabulary Speech Recognition DNN + Dropout
10.4% Joint Training of Convolutional and Non-Convolutional Neural Networks CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN
11.5% Deep Convolutional Neural Networks for LVCSR CNN
12.2% Very Deep Multilingual Convolutional Neural Networks for LVCSR Deep CNN (10 conv, 4 FC layers), multi-scale feature maps

Fisher

(RT03S FSH)

WER Paper Notes
9.6% Purely sequence-trained neural networks for ASR based on lattice-free MMI HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD
9.8% Purely sequence-trained neural networks for ASR based on lattice-free MMI HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD

CHiME (noisy speech)

clean real sim Paper Notes
3.34% 21.79% 45.05% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 68M parameters
6.30% 67.94% 80.27% Deep Speech: Scaling up end-to-end speech recognition CNN + Bi-RNN + CTC (speech to letters)

TODO

PER

TIMIT

(So far, all results trained on TIMIT and tested on the standard test set.)

PER Paper Notes
16.5% Phone recognition with hierarchical convolutional deep maxout networks Hierarchical maxout CNN + Dropout
16.7% Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition CNN in time and frequency + dropout, 17.6% w/o dropout
17.3% Segmental Recurrent Neural Networks for End-to-end Speech Recognition RNN-CRF on 24(x3) MFSC
17.6% Attention-Based Models for Speech Recognition Bi-RNN + Attention
17.7% Speech Recognition with Deep Recurrent Neural Networks Bi-LSTM + skip connections w/ CTC
23% Deep Belief Networks for Phone Recognition (first, modern) HMM-DBN

LM

TODO

Noise-robust ASR

TODO

BigCorp™®-specific dataset

TODO?

Lexicon

  • WER: word error rate
  • PER: phone error rate
  • LM: language model
  • HMM: hidden markov model
  • GMM: Gaussian mixture model
  • DNN: deep neural network
  • CNN: convolutional neural network
  • DBN: deep belief network (RBM-based DNN)
  • RNN: recurrent neural network
  • LSTM: long short-term memory
  • CTC: connectionist temporal classification
  • MMI: maximum mutual information (MMI),
  • MPE: minimum phone error
  • sMBR: state-level minimum Bayes risk
  • SAT: speaker adaptive training
  • MLLR: maximum likelihood linear regression
  • LDA: (in this context) linear discriminant analysis
  • MFCC: Mel frequency cepstral coefficients
  • FB/FBANKS/MFSC: Mel frequency spectral coefficients
  • VGG: very deep convolutional neural networks from Visual Graphics Group, VGG is an architecture of 2 {3x3 convolutions} followed by 1 pooling, repeated

About

Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.