el2e10 / Speech-To-Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Engineering(Speech To Text)

Setup

pip install -r requirements.txt

Additional tools required

Introduction

The project aims to create a data engineering pipeline for curating a Speech-To-Text dataset. The data is collected from publicly available lectures on NPTEL. The audio and corresponding transcription are collected from the website for creating the dataset.

Stages of pipeline

  • Downloading Audio and Transcript

    • NPTEL has all their lectures uploaded to youtube. The audio from these videos can be directly extracted using a tool called 'yt-dlp'.
    • The transcripts from the videos are saved in pdf files and stored in google drive. To download the pdf from google drive a python package called 'gdown' is used.
    • All the links to the resources can be scraped from their course page.
    from data_collector import DataCollector
    
    collector = DataCollector(course_id)
    collector.execute()
    
    # The audio will be stored in Data/Audio and transcript in Data/Transcripts
  • Preprocessing audio

    • The data downloaded from youtube will be in .webm format we convert it into .wav format with a 16KHz sampling rate and mono channel format.
    • A shell script called 'audio_preprocessor.sh' is used for conversion.
    • To make the conversion faster by parallelizing code across n CPUs a tool called 'GNU parallel' is used.
    • For every audio there is a 10-second intro and 32-second end credits this portion of the data is audio is removed as they don't have any speech.
    • To perform audio preprocessing run the following bash command.
    bash audio_preprocessor.sh audio_directory_path output_directory_path
  • Preprocessing text

    • The transcript files in the pdf is converted into txt files in this state of the pipeline.
    • The text undergoes some preprocessing like removal of punctuations and converting to lowercase.
    • The numerical data in the text is converted to words(eg. 10 -> ten).
    from text_preprocessor import TextPreprocessor
    
    preprocessor = TextPreprocessor(transcript_directory)
    preprocessor.execute()
  • Create a training manifest file

    • The output of the data pipeline is a JSON lines file that contains details like audio_filepath, duration, and text.
    from manifest_generator import ManifestGenerator
    
    generator = ManifestGenerator(preprocessed_audio_directory, preprocessed_transcript_directory)
    gnerator.execute()
  • Create a dashboard

    • Dashboard shows some visualization and metrics of text, and audio contents.
    • To see the dashboard run the following command.
    streamlit dashboard.py
    

Note

Please refer 'testing.py' file for understanding how to run the programs.

About


Languages

Language:Python 94.3%Language:Shell 5.7%