Complete set of tools for making your favourite podcast searchable.
Originally created for jupiter network podcasts using meilisearch.
Project contains two main modules:
podcast2text
a cli tool for downloading RSS feed and transcribing podcast episodessearch-load
a cli tool for loading obtained transcriptions to instance of meilisearch
To build you would need following packages on your system:
- cargo
- pkg-config
- openssl
- ffmpeg
There is a nix flake configured to ship build dependencies
just run direnv allow
and run:
cargo build --release
To appease the gods of good taste please add following pre commit hook:
git config --local core.hooksPath .githooks
- Create cache directories and download the whisper model
mkdir -p {models,output}
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=tiny.en
curl -L --output models/$model.bin https://huggingface.co/datasets/ggerganov/whisper.cpp/resolve/main/ggml-$model.bin
- Run the inference on the RSS feed
# get information about the cli
docker run flakm/podcast2text rss --help
docker run \
-v $PWD/models:/data/models \
-v $PWD/output:/data/output \
flakm/podcast2text \
rss \
--num-of-episodes 2 \
https://feed.jupiter.zone/allshows
# or using cargo
cargo run --bin podcast2text --release -- \
--model-path=models/tiny.en.bin \
--output-dir=output/ \
--threads-per-worker=4 \
--download-dir=catalog \
rss \
--worker-count=6 \
https://feed.jupiter.zone/allshows
The output directory should now contain json files with files' transcription and metadata. Note that the results will be cached - so if you restart the job it will not redownload and process already seen rss entries.
Project uses meilisearch as engine back end for search functionality
docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
-p 7700:7700 \
-e MEILI_MASTER_KEY='MASTER_KEY'\
-v $(pwd)/meili_data:/meili_data \
getmeili/meilisearch:v0.29 \
meilisearch --env="development"