Speech Recognition Learning Resources

This repo contains several learning resources for speech recognition, including courses, books, tutorials, papers and toolkits.（continuously updating）

Table of contens

Courses
Books
Papers
Tutorials
Toolkits

Courses

(Recommended) Automatic Speech Recognition (ASR) 2018-2019 Lectures, School of Informatics, University of Edingburgh [Website]
Speech recognition, EECS E6870 - Spring 2016, Columbia University [Website]
CS224N: Natural Language Processing with Deep Learning, Stanford [Website] [Video(Winter 2021)] [Video(Winter 2017)]
CS224S: Spoken Language Processing (Winter 2021), Stanford [Website]
DLHLP: DEEP LEARNING FOR HUMAN LANGUAGE PROCESSING, 2020 SPRING, Hung-yi Lee [Website] [Video(Spring 2020)]
Microsoft DEV287x: Speech Recognition Systems, 2019 [Website]
语音识别从入门到精通，2019，谢磊 (NOT FREE) [Website]
數位語音處理概論，国立**大学，李琳山 [Website]

Books

Fundamentals of speech recognition, Lawrence Rabiner, Being-Hwang Juang, 1993 [Book]
Spoken language processing: A guide to theory, algorithm, and system levelopment, xuedong Huang, Alex acero, hsiao-wuen Hon, 2001 [Book]
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky & James H. Martin [Website] [Book 3rd Ed]
Automatic speech recognition: A Deep Learning Approach, Dong Yu and Li Deng, Springer, 2014 [Book]
Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 [Website] [Book]
《解析深度学习：语音识别实践》，俞栋，邓力，电子工业出版社
《Kaldi 语音识别实战》，陈果果，电子工业出版社
《语音识别：原理与应用》，洪青阳，电子工业出版社
《语音识别基本法》，汤志远，电子工业出版社
《统计学习方法》李航，清华大学出版社
《语音信号处理》韩继庆，清华大学出版社
《语音信号处理》赵力，机械工业出版社

Papers

HMM: Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [Paper]
EM: Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J]. International Computer Science Institute, 1998, 4(510): 126. [Paper]
CTC: Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. [Paper]

Tutorials

WFST
- An Introduction to Weighted Automata in Machine Learning, Awni Hannun, 2021. [PDF]
k2
- Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall), Interspeech 2021. [Video]
- Progress in ASR with Next-Gen Kaldi, BAAI 2022. [Video] [Slides]
- Speech Recognition with Icefall + Lhotse, Interspeech 2023. [Slides]

Toolkits

listed in no particular order

kaldi [Github] [Doc]
next-gen Kaldi [Github]
- k2: FSA/FST algorithms, differentiable, with PyTorch compatibility. [Github] [Doc]
- icefall: Speech recognition recipes using k2. [Github] [Doc]
- sherpa: Streaming and non-streaming ASR server for next-gen Kaldi. [Github] [Doc]
- sherpa-onnx: Real-time speech recognition using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go. [Github] [Doc]
- sherpa-ncnn: Real-time speech recognition using next-gen Kaldi with ncnn without Internet connection. Support iOS, Android, Raspberry Pi, VisionFive2, etc. [Github] [Doc]
- lhotse: Tools for handling speech data in machine learning projects. [Github] [Doc]
- ~~snowfall(deprecated)~~ [Github]
FunASR [Github] [Doc]
- A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models.
- Gao, Z., Li, Z., Wang, J., Luo, H., Shi, X., Chen, M., ... & Zhang, S. (2023). FunASR: A Fundamental End-to-End Speech Recognition Toolkit. arXiv preprint arXiv:2305.11013.
espnet/espnet2 [Github]
- Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.
wenet [Github]
- Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
- Zhang B, Wu D, Yao Z, et al. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[J]. arXiv preprint arXiv:2012.05481, 2020.
- Wu D, Zhang B, Yang C, et al. U2++: Unified two-pass bidirectional end-to-end model for speech recognition[J]. arXiv preprint arXiv:2106.05642, 2021.
NeMo [Github] [Doc]
- NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS).
Fairseq [Github] [Doc]
- Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
speechbrain [Github] [Doc]
- SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch.
paddlespeech [Github] [Doc]
- PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
eesen R.I.P. [Github]
- Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015: 167-174.
warp_ctc [Github]
- A fast parallel implementation of CTC, on both CPU and GPU.
htk
sphinx

weimeng23 / speech-recognition-learning-resources

Speech Recognition Learning Resources

Table of contens

Courses

Books

Papers

Tutorials

Toolkits

About