This project aims to build a system that can automatically transcribe speech to text. The system will be able to transcribe speech from various sources such as YouTube videos, audio files, etc. The system will be built using the NeMo toolkit, which is a toolkit for building state-of-the-art conversational AI models.
Supported functions:
- Collect data from YouTube
- Process data
- Automatic Speech Recognition (ASR)
- Speaker Diarization
- Pronunciation/Grammar Assessment
I recommend to use anaconda to create environment
conda create -n asr python=3.10
conda activate asrClone the repository
git clone https://github.com/Foxxy-HCMUS/automatic-speech-recognition.git sudo apt-get install ffmpeg
pip install git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560
pip install dora-search lameenc openunmix wget Cython
pip install --no-build-isolation "nemo_toolkit[asr]==1.23.0"
pip install --no-deps git+https://github.com/facebookresearch/demucs#egg=demucs
pip install git+https://github.com/oliverguhr/deepmultilingualpunctuation.git
pip install ctranslate2==3.24.0Install requirements
# Install in editable mode to avoid constant re-installation
# Also include all optional dependencies
python -m pip install -e .[all]
# Install pre-commit hooks to automatically check/format code on commits
pre-commit installIn case pytorch cannot compiled with cuda, please run the following command
pip install torch==1.13.1+cu116 torchaudio==0.13.1 torchvision==0.14.1+cu116 --extra-index-url=https://download.pytorch.org/whl/cu116- Please visit the notebook
task_1.ipynb, run all cells to see the full pipeline for ASR and Speaker Diarization.
- Currently, the system is in development and will be available soon. Code for this task is in the
task_2.ipynbnotebook.
- Collect
python /src/asr/collect_data.py- Preprocess
python /src/asr/parser.py- Clean up
./clean_up.sh