Jupiter Search

Complete set of tools for making your favourite podcast searchable.

Originally created for jupiter network podcasts using meilisearch.

Overview

Project contains two main modules:

podcast2text a cli tool for downloading RSS feed and transcribing podcast episodes
search-load a cli tool for loading obtained transcriptions to instance of meilisearch

Getting started

To build you would need following packages on your system:

cargo
pkg-config
openssl
ffmpeg

There is a nix flake configured to ship build dependencies just run direnv allow and run:

cargo build --release

To appease the gods of good taste please add following pre commit hook:

git config --local core.hooksPath .githooks

Usage

Run downloading podcasts

Process audio from RSS feed

Create cache directories and download the whisper model

mkdir -p {models,output}
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=tiny.en

curl -L --output models/$model.bin https://huggingface.co/datasets/ggerganov/whisper.cpp/resolve/main/ggml-$model.bin

Run the inference on the RSS feed

# get information about the cli
docker run flakm/podcast2text rss --help

docker run \
    -v $PWD/models:/data/models \
    -v $PWD/output:/data/output \
    flakm/podcast2text \
    rss \
    --num-of-episodes 2 \
    https://feed.jupiter.zone/allshows 

# or using cargo
cargo run --bin podcast2text --release -- \
    --model-path=models/tiny.en.bin \
    --output-dir=output/ \
    --threads-per-worker=4 \
    --download-dir=catalog \
    rss \
    --worker-count=6 \
    https://feed.jupiter.zone/allshows

The output directory should now contain json files with files' transcription and metadata. Note that the results will be cached - so if you restart the job it will not redownload and process already seen rss entries.

Create search engine

Install meilisearch

Project uses meilisearch as engine back end for search functionality

docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='MASTER_KEY'\
    -v $(pwd)/meili_data:/meili_data \
    getmeili/meilisearch:v0.29 \
    meilisearch --env="development"

Run index creation and data loading

About

Convert podstast RSS feed to transcriptions using whisper model

podcast rust transcription

Apache License 2.0

Languages

Language:Rust 92.3%Language:Nix 3.3%Language:Dockerfile 2.4%Language:Shell 2.0%