NeuralSP: Neural network based Speech Processing
Connectionist Temporal Classification (CTC)
- beam search
- Shallow fusion [link]
Attention-based sequence-to-sequence
- CNN encoder
- (bidirectional/unidirectional) LSTM encoder
- CNN+(bidirectional/unidirectional) LSTM encoder
- self-attention (Transformer) encoder [link]
- Time-Depth Seprarabel (TDS) convolutional encoder [link] (NEW!)
- RNN decoder
- Beam search
- Shallow fusion [link]
- Cold fusion [link]
- Deep fusion [link]
- Forward-backward attention decoding [link]
- Transformer decoder
- RNN decoder
- location [link]
- additive [link]
- dot-product
- Luong's dot/general/concat [link]
- Multi-headed dor-product [link]
- Transformer decoder
- Multi-headed dor-product [link]
- RNNLM (recurrent neural network language model)
- Gated convolutional LM [link]
- phoneme (TIMIT, Switchboard)
- grapheme
- wordpiece (BPE, wordpiece)
- word
- word-char mix
Multi-task learning (MTL)
Multi-task learning (MTL) with different units are supported to alleviate data sparseness.
- Hybrid CTC/attention [link]
- Hierarchical Attention (e.g., word attention + character CTC) [link]
- Hierarchical CTC (e.g., word CTC + character CTC) [link]
- Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
- Forward-backward attention [link]
- RNNLM objective [link]
Performance (word error rate)
model |
test_dev93 |
test_eval92 |
Char attn |
N/A |
N/A |
BPE1k attn |
N/A |
N/A |
model |
eva1l |
eval2 |
eval3 |
Char attn |
N/A |
N/A |
N/A |
+ RNNLM |
N/A |
N/A |
N/A |
BPE30k attn |
8.8 |
6.3 |
6.9 |
+ RNNLM |
8.2 |
6.0 |
6.7 |
model |
SWB |
CH |
Char attn |
N/A |
N/A |
BPE10k attn |
N/A |
N/A |
Word10k attn |
N/A |
N/A |
model |
dev-clean |
dev-other |
test-clean |
test-other |
Char attn |
N/A |
N/A |
N/A |
N/A |
BPE30k attn |
N/A |
N/A |
N/A |
N/A |
Word30k attn |
N/A |
N/A |
N/A |
N/A |