Project Description:
Miscellaneous ai projects that prototype the use of AI and machine learning to enhance predictive results.
You may find useful code or an approach here but the README may not be well explained and the code likely requires refactoring, sorry but these are prototypes. However, I'll specify if a project works or does not and I will provide a video running a project.
Projects included:
1. vectordb/gpt-embeddings: Vectorize documents in Pinecone vector db for LLM queries:
-
Works at the time uploaded
-
Project downloads the Bank of England's Monetary Policy pdf Report November 2023
-
Chunk the pdf pages to vectorize each page
-
Vectorizes each page in Pinecone using the GPT Retrieval Plugin with web framework FastAPI(https://blog.devgenius.io/getting-started-with-fast-api-c7e52e68685f). The OpenAI released tool GPT Retrieval Plugin serves as our database interface handling all chunkings, embedding model calls, and vector database interaction.
-
User asks chat-gpt-3.5 a question about the document and specifies N number of embeddings (pdf pages) relevant to the question and use these to contextualize its response
You can also see the app execution video here.
pinecone-embeddings-to-gpt_.mp4
Special thanks to @Roulin for the clear instructions in the blog fail this link here.
2. classification/ada_and_randomforest: Mail Spam Classification using OpenAI embeddings and a Random Forest Classification model
-
Vectorize mail dataset with OpenAI's text-embedding-ada-002
-
Train a random forest classification model with these embedding vectors (features) and labels (mail is spam or ham type)
-
Test the model and report stats
(oai310env) sergio@Home-Win11:~/my-repos/tooling-ai/classification/ada_and_randomforest$ ./classify_ada_rndforest.py
Start to train the model.
Time elapsed to train the model for 50 mails: 0 minutes, 0 seconds, 48 milliseconds
precision recall f1-score support
0 0.75 1.00 0.86 3
1 1.00 0.86 0.92 7
accuracy 0.90 10
macro avg 0.88 0.93 0.89 10 weighted avg 0.93 0.90 0.90 10
Special thanks to Kaggle for the dataset and the Geeks for Greeks community for the clear instructions