Data Engineering(Speech To Text)
Setup
pip install -r requirements.txt
Additional tools required
Introduction
The project aims to create a data engineering pipeline for curating a Speech-To-Text dataset. The data is collected from publicly available lectures on NPTEL. The audio and corresponding transcription are collected from the website for creating the dataset.
Stages of pipeline
-
Downloading Audio and Transcript
- NPTEL has all their lectures uploaded to youtube. The audio from these videos can be directly extracted using a tool called 'yt-dlp'.
- The transcripts from the videos are saved in pdf files and stored in google drive. To download the pdf from google drive a python package called 'gdown' is used.
- All the links to the resources can be scraped from their course page.
from data_collector import DataCollector collector = DataCollector(course_id) collector.execute() # The audio will be stored in Data/Audio and transcript in Data/Transcripts
-
Preprocessing audio
- The data downloaded from youtube will be in .webm format we convert it into .wav format with a 16KHz sampling rate and mono channel format.
- A shell script called 'audio_preprocessor.sh' is used for conversion.
- To make the conversion faster by parallelizing code across n CPUs a tool called 'GNU parallel' is used.
- For every audio there is a 10-second intro and 32-second end credits this portion of the data is audio is removed as they don't have any speech.
- To perform audio preprocessing run the following bash command.
bash audio_preprocessor.sh audio_directory_path output_directory_path
-
Preprocessing text
- The transcript files in the pdf is converted into txt files in this state of the pipeline.
- The text undergoes some preprocessing like removal of punctuations and converting to lowercase.
- The numerical data in the text is converted to words(eg. 10 -> ten).
from text_preprocessor import TextPreprocessor preprocessor = TextPreprocessor(transcript_directory) preprocessor.execute()
-
Create a training manifest file
- The output of the data pipeline is a JSON lines file that contains details like audio_filepath, duration, and text.
from manifest_generator import ManifestGenerator generator = ManifestGenerator(preprocessed_audio_directory, preprocessed_transcript_directory) gnerator.execute()
-
Create a dashboard
- Dashboard shows some visualization and metrics of text, and audio contents.
- To see the dashboard run the following command.
streamlit dashboard.py
Note
Please refer 'testing.py' file for understanding how to run the programs.