This project contains two main Python files:
data_processing.pyβ Handles PDF processing, chunking, and vector embedding.app.pyβ A Streamlit web interface for interacting with a chatbot based on the processed data.
- Loads and processes multiple PDFs listed in the
SOURCESlist. - Extracts text from PDFs using
PyMuPDF (fitz). - Chunks text by token count using
tiktoken. - Embeds text into a Pinecone vector store using OpenAI embeddings.
- A Streamlit UI that allows users to interact with the embedded data.
- Uses LangChainβs
RetrievalQAfor answering questions based on the Pinecone vector store. - Displays chat history and responses dynamically.
This project relies on environment variables for API keys and configuration. These should be stored in a .env file located in the project root.
.env Example:
OPENAI_API_KEY=your_openai_key_here
PINECONE_API_KEY=your_pinecone_key_here
PINECONE_ENV=your_pinecone_env
PINECONE_INDEX_NAME=your_index_name
β οΈ Important:
Make sure.envis added to your.gitignorefile to avoid accidentally leaking your keys.
# .gitignore
.envAPI keys are loaded using os.getenv(...):
-
In
data_processing.py:openai_api_key = os.getenv("OPENAI_API_KEY")
-
In
app.py:openai_api_key = os.getenv("OPENAI_API_KEY")
If you want to switch environments or services, simply change the keys in your .env file β no code modification is needed.
-
Install Dependencies
pip install -r requirements.txt
-
Prepare .env
- Create a
.envfile and insert your API keys as described above.
- Create a
-
Run the Streamlit App
streamlit run app.py
- The
SOURCESandIDsarrays indata_processing.pydetermine which PDFs are loaded. Add your PDFs there if needed. - Make sure the names in
SOURCESmatch actual filenames in your project directory. - Ensure
pineconeandopenaiservices are correctly set up before running the app.