wer_are_we

WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)

To be updated with Interspeech 2015...

WER

LibriSpeech

(Possibly trained on more data than LibriSpeech.)

WER test-clean	WER test-other	Paper	Notes
5.51%	13.97%	LibriSpeech: an ASR Corpus Based on Public Domain Audio Books	HMM-DNN + pNorm*
8.01%	22.49%	same, Kaldi	HMM-(SAT)GMM
	12.51%	Audio Augmentation for Speech Recognition	TDNN + pNorm + speed up/down speech

WSJ

(Possibly trained on more data than WSJ.)

WER eval'92	WER eval'93	Paper	Notes
3.63%	5.66%	LibriSpeech: an ASR Corpus Based on Public Domain Audio Books	test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*
5.6%		Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal	CNN over RAW speech (wav)

Switchboard Hub5'00

(Possibly trained on more data than SWB, but test set = full Hub5'00.)

WER (SWB)	WER (full=SWB+CH)	Paper	Notes
12.6%	16%	Deep Speech: Scaling up end-to-end speech recognition	CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB
12.6%	18.4%	Sequence-discriminative training of deep neural networks	HMM-DNN +sMBR
12.9%	19.3%	Audio Augmentation for Speech Recognition	TDNN + pNorm + speed up/down speech
15%	19.1%	Building DNN Acoustic Models for Large Vocabulary Speech Recognition	DNN + Dropout
10.4%		Joint Training of Convolutional and Non-Convolutional Neural Networks	CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN
11.5%		Deep Convolutional Neural Networks for LVCSR	CNN

PER

TIMIT

(So far, all results trained on TIMIT and tested on the standard test set.)

PER	Paper	Notes
16.7%	Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition	CNN in time and frequency + dropout, 17.6% w/o dropout
17.6%	Attention-Based Models for Speech Recognition	Bi-RNN + Attention
17.7%	Speech Recognition with Deep Recurrent Neural Networks	Bi-LSTM + skip connections w/ CTC
23%	Deep Belief Networks for Phone Recognition	(first, modern) HMM-DBN

LM

TODO

Noise-robust ASR

TODO

BigCorp™®-specific dataset

TODO?

Lexicon

WER: word error rate
PER: phone error rate
LM: language model
HMM: hidden markov model
GMM: Gaussian mixture model
DNN: deep neural network
CNN: convolutional neural network
DBN: deep belief network (RBM-based DNN)
RNN: recurrent neural network
LSTM: long short-term memory
CTC: connectionist temporal classification
MMI: maximum mutual information (MMI),
MPE: minimum phone error
sMBR: state-level minimum Bayes risk
SAT: speaker adaptive training
MLLR: maximum likelihood linear regression
LDA: (in this context) linear discriminant analysis
MFCC: Mel frequency cepstral coefficients
FB/FBANKS/MFSC: Mel frequency spectral coefficients

wbgxx333 / wer_are_we