paramsingh / transcribe

Home Page:https://transcribe.param.codes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transcribe

whisper experiments

notes: https://docs.google.com/document/d/17_RQSM3_GgVwC3HanT350tm9OaY-WZ9K4y6ZPJfiLOc/edit

trello: https://trello.com/b/523mepi1/transcribe

How to run

pip install -r requirements.txt
pip install -r requirements_dev.txt

Create config file, and fill in values as needed

cp transcribe/config.py.sample transcribe/config.py

Now, run the Flask API server.

python -m transcribe.api

Run the frontend

cd frontend
npm install
npm run dev

There's background scripts that you might want to run.

python -m transcribe.processor.transcribe # transcribes videos
python -m transcribe.processor.improve # improves transcriptions and creates embeddings

API

  • Create a request to transcribe a URL: POST /api/v1/transcribe

Production:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"link":"https://www.youtube.com/watch?v=0lJKucu6HJc"}' \
  https://transcribe.param.codes/api/v1/transcribe

Localhost:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"link":"https://www.youtube.com/watch?v=0lJKucu6HJc"}' \
  http://localhost:6550/api/v1/transcribe
  • Get results for url: GET /api/v1/transcription/<token>/details

https://transcribe.param.codes/api/v1/transcription/331118e7-8b4a-4e7f-bfe0-0cc5cec2a974/details

http://localhost:6550/api/v1/transcription/b59803b8-ef45-47fc-9c5a-8343df208179/details

Frontend dev

cd frontend
npm run dev

Run tests

First, install dev requirements.

pip install -r requirements_dev.txt

Then, just run:

pytest transcribe

Very important: Tests are not necessary for every single feature. Prioritize velocity over code coverage. However, we should try to keep all tests passing. pytest transcribe should be green.

Helpful stuff

Query to figure out how many transcriptions remain to be improved.

SELECT count(*)
          FROM transcription t
     LEFT JOIN embedding e
            ON t.id = e.transcription_id
         WHERE (t.improvement is NULL OR t.summary IS NULL OR e.id IS NULL)
           AND t.result is NOT NULL
           AND t.improvement_failed = 0
           -- HUGE HACK: not improving old transcriptions
           -- TODO (param): eventually create embeddings for these
           -- and remove this hack
           AND CAST(strftime('%s', date(t.created)) as integer) > CAST(strftime('%s', '2023-02-16') as integer)

About

https://transcribe.param.codes


Languages

Language:Python 50.5%Language:TypeScript 28.1%Language:Jupyter Notebook 16.4%Language:CSS 4.8%Language:Shell 0.1%Language:JavaScript 0.1%