Chat with 100+ YouTube videos from any creator in less than 10 minutes. This project combines basic Python scripting, vector embeddings, OpenAI, Pinecone, and Langchain into a modern chat interface, allowing you to quickly reference any content your favorite YouTuber covers. Type in natural language and get returned detailed answers: (1) in the style / tone of your YouTuber, and (2) with the top 2-3 specific videos referenced hyperlinked.
Example used in this repo is tech content creator Marques Brownlee, also known as MKBHD
Note: macOS version, adjust accordingly for Windows / Linux
Clone and install dependencies:
git clone https://github.com/vdutts7/yt-ai-chat
cd yt-ai-chat
npm i
Copy .env.example
and rename to .env
in root directory. Fill out API keys:
ASSEMBLY_AI_API_TOKEN=""
OPENAI_API_KEY=""
PINECONE_API_KEY=""
PINECONE_ENVIRONMENT=""
PINECONE_INDEX=""
Get API keys:
- AssemblyAI - ~ $3.50 per 100 vids
- OpenAI
- Pinecone
IMPORTANT: Verify that .gitignore
contains .env
in it.
Outline:
- Export metadata (.csv) of YouTube videos ⬇️
- Download the audio files
- Transcribe audio files
Navigate to scripts
folder, which will host all of the data from the YouTube videos.
cd scripts
Setup python environemnt:
conda env list
conda activate youtube-chat
pip install -r requirements.txt
Scrape YouTube channel-- replace @mkbhd
with channel of your choice. Replace 100
with the number of videos you want included (the script traverses backwards starting from most recent upload). A new file mkbhd.csv
will be created at the directory as referenced below:
python scripts/scrape_vids.py https://www.youtube.com/@mkbhd 100 scripts/vid_list/mkbhd.csv
Refer to example_mkbhd.csv
inside folder and verify your output matches this format:
Download audio files:
python scripts/download_yt_audios.py scripts/vid_list/mkbhd.csv scripts/audio_files/
We will utilize AssemblyAI's API wrapper class for OpenAI's Whisper API. Their script provides step-by-step directions for a more efficient, faster speech-to-text conversion as Whisper is way too slow and will cost you more. I spent ~ $3.50 to transcribe the 100 videos for MKBHD.
python scripts/transcribe_audios.py scripts/audio_files/ scripts/transcripts
Upsert to Pinecone database:
python scripts/pinecone_helper.py scripts/vid_list/mkbhd.csv scripts/transcripts/
Pinecone index setup I used below. I used P1 since this is optimized for speed. 1536 is OpenAI's standard we're limited to when querying data from the vectorstore:
Breaking down scripts/pinecone_helper.py
:
- Chunk size of 1000 characters with 500 character overlap. I found this working for me but obviously experiment and adjust according to your content library's size, complexity, etc.
- Metadata: (1) video url and (2) video title
With Pinecone vectorstore loaded, we use Langchain's Conversational Retrieval QA to ask questions, extract relevant metadata from our embeddings, and deliver back to the user in a packaged format as an answer.
The relevant video titles are cited via hyperlinks directly to the video url.
NextJs styled with Tailwind CSS. src/pages/index.tsx
contains base skeleton. src/pages/api/chat-chain.ts
is heart of the code where the Langchain connections are outlined. You should be able to type and ask questions now. Done ✅
- Add sidebar of video links to reference
- User auth + DB backend to store chat history / log queries
- Improve bot personality: edit prompt template in
/src/pages/api/chat-chain.ts
to fine-tune output to sound more realistic.