2023_AI_Academy_ASR

Notebooks for "AI 응용-음성"

Course Description

0. Tutorial

Tutorial for python and data science packages

python review
numpy
matplotlib
PyTorch Tensor

1. Audio file handling

Audio file handling using torchaudio

Load audio file(torchaudio.load)
Feature extraction(Mel-spectrogram, MFCC)

3. Audio Classification using MLP

Audio MNIST classification using MLP(torch.Linear)

4. CTC

Simple Exercise(model training using CTC loss) for Connectionist Temporal Classification

5. Whisper

Exercise using OpenAI - Whisper and Gradio

6. E2E ASR model finetune with Nemo

Quartznet Model finetune with Nemo(English to Korean)

7. WFST

Exercise for WFST using k2

C,L,G transducer
composition, determinization

8. E2E ASR model finetune with HuggingFace

Wav2Vec2.0 Model finetune with HunggingFace(English to Korean)

Whisper Model finetune with HunggingFace(English to Korean)

Course Materials

Chapter 1

Huang, X. D., Acero, A., Hon, H. W., & Foreword By-Rabiner, L. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development - Chapter 9. Pearson Education. (https://dl.acm.org/doi/book/10.5555/560905#cited-by-sec)
Uday Kamath, John Liu, and James Whitaker (2019). Deep Learning for NLP and Speech Recognition. Springer. (https://www.amazon.com/Deep-Learning-NLP-Speech-Recognition/dp/3030145980)

Chapter 4

Introduction to Deep Learning: MIT 6.S191. (2023). Lecture2 – Deep Sequence Modeling. (http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L2.pdf)

Chapter 5: Sequence-to-Sequence with Attention

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv preprint arXiv:1409.3215. (https://arxiv.org/abs/1409.3215)
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078. (https://arxiv.org/abs/1406.1078)
Stanford University. (2022). CS224N: Natural Language Processing with Deep Learning: Lecture7 – Machine Translation, Sequence-to-Sequence and Attention. (https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture07-nmt.pdf)
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473. (https://arxiv.org/abs/1409.0473)

Chapter 6

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML'06: Proceedings of the 23rd international conference on Machine learning. (https://www.cs.toronto.edu/~graves/icml_2006.pdf)
Olah, C., & Carter, S. (2017). Distill: Sequence Modeling With CTC. (https://distill.pub/2017/ctc/)
Bluche, T. (N.D.). The intriguing blank label in CTC. [Blog post]. (https://www.tbluche.com/ctc_and_blank.html)

Chapter 7: Transformer

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. (https://arxiv.org/abs/1706.03762)
Olah, C. (2015). Understanding LSTM Networks. [Blog post]. (https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
Stanford University. (2022). CS224N: Natural Language Processing with Deep Learning: Lecture9 – Transformers. (https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture09-transformers.pdf)
Jurafsky, D., & Martin, J. H. (2022). Speech and Language Processing: Chapter9 – RNNs and LSTMs. (https://web.stanford.edu/~jurafsky/slp3/)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. (https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint arXiv:1607.06450. (https://arxiv.org/abs/1607.06450)
Alammar, J. (N.D.). The Illustrated Transformer. [Blog post]. (https://jalammar.github.io/illustrated-transformer/)
Bloem, P. (N.D.). TRANSFORMERS FROM SCRATCH. [Blog post]. (https://peterbloem.nl/blog/transformers)
Gulati, A., Qin, J., Chiu, C. C., et al. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv preprint arXiv:2005.08100. (https://arxiv.org/abs/2005.08100)

Chapter 8

Hinton, G., Deng, L., Yu, D., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine. (https://www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf)
Young, S. J., & Woodland, P. C. (1994). Tree-Based State Tying for High Accuracy Acoustic Modelling. Proc. ARPA Spoken Language Systems Technology Workshop. (https://aclanthology.org/H94-1062.pdf)
Barsky, M. (N.D.). Victoria university: data mining Lab3 - Classifiers: toy example of decision tree. (http://csci.viu.ca/~barskym/teaching/DM_LABS/LAB_3/Lab3_decisiontreeexample.pdf)
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks. Advances in Neural Information Processing Systems 19. (https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf)
Graves, A., Mohamed, A. R., & Hinton, G. (2013). SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. (http://www.cs.toronto.edu/~hinton/absps/RNN13.pdf)

Chapter 9

American National Corpus (ANC). (N.D.). AMC. (https://anc.org/data/anc-second-release/frequency-data/)
Sekine, S. (2010). On-Demand Information Extraction and Linguistic Knowledge Acquisition. New York University. (https://nlp.cs.nyu.edu/sekine/papers/10spring.pdf)
Gillick, L., & Cox, S. J. (1991). Some statistical issues in the comparison of speech recognition algorithms. ICASSP-91: International Conference on Acoustics, Speech, and Signal Processing. (https://www.researchgate.net/publication/2360210_Comparison_Of_Part-Of-Speech_And_Automatically_Derived_Category-Based_Language_Models_For_Speech_Recognition)

Chapter 10

Mohri, M., Pereira, F., & Riley, M. (2008). SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS. Springer Handbook of Speech Processing. (https://cs.nyu.edu/~mohri/pub/hbka.pdf)
Panayotov, V. (2012). Decoding graph construction in Kaldi: A visual walkthrough. [Blog post]. (http://vpanayotov.blogspot.com/2012/06/kaldi-decoding-graph-construction.html)
Lecture on Weighted Finite State Transducers in Automatic Speech Recognition. (N.D.). Brno University of Technology. (http://www.fit.vutbr.cz/study/courses/ZRE/public/pred/13_wfst_sid_lid/zre_lecture_asr_wfst.pdf)

dbstj1231 / 2023_AI_Academy_ASR