SemanTweet Search allows you to search your Twitter archive using semantic similarity. It preprocesses your tweets, generates embeddings using OpenAI's small/large embedding model, stores the data and embeddings in a LanceDB vector db, and provides a web interface to search and view the results.
You can do semantic search post filtering by time, likes, retweets, media only or link only tweets too.
Uses:
- twitter archive for data
- semantic search using openai embeddings
- lance db for vector search and sql operations
- flask for server
Currently, only supports openai embeddings.
- Python 3.x
- OpenAI API key
- Twitter archive data
-
Clone the repository:
git clone https://github.com/sankalp1999/semantweet-search.git
-
Download your Twitter archive (takes 2 days to be available) and extract it. Put the extracted folder at the root of this project and rename it to
twitter-archive
. -
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- For Unix/Linux:
source venv/bin/activate
- For Windows:
venv\Scripts\activate
- For Unix/Linux:
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_api_key
-
Choose the desired OpenAI embedding model (small or large) in the
openai/async_openai_embedding_two.py
file. -
Run the setup script:
chmod +x run_scripts.sh ./run_scripts.sh
-
Start the application:
python app.py
or
flask run
Enjoy!
graph TD
A[Twitter Archive Data] --> B[preprocess_tweets_one.py]
B --> C[Preprocessed Tweets CSV]
C --> D[async_openai_embedding_two.py]
D --> E[Embeddings CSV]
E --> F[create_lance_db_table_openai_three.py]
F --> G[LanceDB Database]
G --> H[Web Interface]
H --> I[Search Tweets]
I --> J[View Results]
The OpenAI embedding flow consists of the following steps:
-
preprocess_tweets_one.py
: This script preprocesses the tweets from the Twitter archive, extracting relevant information and saving it to a CSV file. -
async_openai_embedding_two.py
: This script reads the preprocessed tweets from the CSV file, generates embeddings using OpenAI's embedding model asynchronously, and saves the embeddings to a new CSV file. -
create_lance_db_table_openai_three.py
: This script reads the generated embeddings from the CSV file, creates a LanceDB table using the specified schema, and stores the data in the database.
The run_scripts.sh
script automates the execution of these steps in the correct order.
-
The project uses the
text-embedding-3-large
model by default. You can change the model by modifying theMODEL_NAME
variable in the relevant scripts. -
The batch size for generating embeddings is set to 32 to stay within the token limit. Adjust the batch size if needed.
-
The LanceDB database is stored in the
data/openai_db
directory. -
The project also includes a synchronous version of the OpenAI embedding generation script (
create_openai_embedding_sync_two.py
), which can be used as an alternative to the asynchronous version.