Podcast Transcription Data Pipeline using Apache Airflow

Project Overview

In this project, we'll create a data pipeline using Apache Airflow to download podcast episodes and automatically transcribe them using speech recognition. The results will be stored in a SQLite database, making it easy to query and analyze the transcribed podcast content.

While this project doesn't strictly require the use of Apache Airflow, it offers several advantages:

We can schedule the project to run on a daily basis.
Each task can run independently, and we receive error logs for troubleshooting.
Tasks can be easily parallelized, and the project can run in the cloud if needed.
It provides extensibility for future enhancements, such as adding more advanced speech recognition or summarization.

By the end of this project, you'll have a solid understanding of how to utilize Apache Airflow and a practical project that can serve as a foundation for further development.

Project Steps

Download Podcast Metadata XML and Parse
- Obtain the metadata for podcast episodes by downloading and parsing an XML file.
Create a SQLite Database for Podcast Metadata
- Set up a SQLite database to store podcast metadata efficiently.
Download Podcast Audio Files Using Requests
- Download the podcast audio files from their sources using the Python requests library.
Transcribe Audio Files Using Vosk
- Implement audio transcription using the Vosk speech recognition library.

Getting Started

Local Setup

Before you begin, ensure that you have the following prerequisites installed locally:

Apache Airflow 2.3+
Python 3.8+
Python packages:
- pandas
- sqlite3
- xmltodict
- requests

Please follow the Airflow installation guide to install Apache Airflow successfully.

Data

During the project, we'll download the required data, including a language model for Vosk and podcast episodes. If you wish to explore the podcast metadata, you can find it here.

Code

You can access the project code in the code directory.

Project Screenshots

airflow database sqlite connection
Dag
Get Episodes Task output

Project Usage

To run the data pipeline, follow the steps provided in the steps.md file.

About

Creating a data pipeline using Airflow. The pipeline will download podcast episodes and automatically transcribe them using speech recognition. We'll store our results in a SQLite database that we can easily query.

airflow dags sqlite3

Languages

Language:Python 100.0%