Mini-Search-Engine

The Product

It is currently hosted on a VPS. Accessible via fungthedev.fun/index.html.

Installation

To install the required dependencies, run the following command:

pip install .

Precompiling Tantivy Package

To speed up the installation process, you can precompile the Tantivy package. First, ensure you have the wheel package installed:

pip install wheel

Then, precompile the packages (especially for Tantivy) and save the wheel file to a specified folder:

pip wheel . -w {folder path to put wheel}

Installing from Precompiled Wheels

If you have precompiled wheels, you can install the dependencies from the wheels directory:

pip install --find-links={folder path to put wheel} -r requirements.txt

Starting the Virtual Environment

To start the virtual environment, run the following command:

source venv/bin/activate

Running the Crawler

To run the crawler, use the following command from the root folder:

python main.py crawl

Running the Indexing

To run the indexing, use the following command from the root folder:

python main.py index

Database Connection

It is currently hosted on CockroachDB. You can fork the repo and use any database.

To connect to the database, you need to download the CA certificates. Run the following command:

curl --create-dirs -o $HOME/.postgresql/root.crt 'https://cockroachlabs.cloud/clusters/928ba3a6-9973-40e6-883a-125edc5f29ae/cert'

Set up the database URL environment variable:

export DATABASE_URL="cockroachdb://<SQL-USER-NAME>:<ENTER-SQL-USER-PASSWORD>@paula-the-crawler-7529.j77.aws-us-west-2.cockroachlabs.cloud:26257/defaultdb?sslmode=verify-full"

Right now I am using a .env file for the SQL username and password.

API Service

The API service is built using Flask, a lightweight WSGI web application framework in Python. To run the API as a service, I use Gunicorn, a Python WSGI HTTP Server for UNIX. This combination allows for efficient handling of multiple requests and ensures the API is robust and scalable.

To start the API service, use the following command:

gunicorn -w 4 -b 0.0.0.0:8000 app:app

This command runs the Flask application (app) with Gunicorn, using 4 worker processes and binding to all IP addresses on port 8000.

Why Choose Tantivy Over Vespa for Indexing

Project Requirement

Our primary goal is to build a lightweight, low-latency search engine that can meet a latency requirement of 50ms. The system will be relatively simple and will not require advanced features like ML-based ranking.

Pros and Cons

Tantivy

Pros:

Lightweight and low-latency.
High performance for single-node or simple systems.
Easier to integrate and customize for specific needs.

Cons:

Requires manual optimization for scaling.
Lacks advanced features like ML-based ranking.

Vespa

Pros:

Built-in distributed architecture and fault tolerance.
Scalable for large datasets and complex ranking needs.
Supports real-time updates and analytics.

Cons:

More complex to set up and manage.
Higher resource consumption compared to Tantivy.

Why Tantivy Fits the Project Requirement

Given our goal of achieving a lightweight, low-latency search engine with a 50ms latency requirement, Tantivy is a suitable choice. It provides the necessary performance for our single-node system and allows for easier integration and customization. While Vespa offers advanced features and scalability, these are not required for our current project scope, making Tantivy the more efficient and straightforward option.

Challenges Encountered

Crawler

One of my biggest struggles was learning how to crawl from scratch and tune the crawler for optimal performance on both my PC and a low-spec VPS. Without proper optimization, the load was too significant for such limited resources. I had to experiment with various settings and strategies to ensure efficient crawling without overwhelming the system.

Tantivy Indexing

One of my biggest struggles was achieving better relevancy in search results. In the early stages of development, I encountered an issue where a simple word appearing too many times in the content would significantly increase the score. This led to less relevant documents being ranked higher, which was not ideal for the search engine's performance.

Ranking Optimization

The ranking strategy involves boosting certain fields to improve the relevancy of search results. Specifically, the following boosts:

Title: Boosted by 1.5 to prioritize documents with matching titles.
Content: Boosted by 2.0 to give higher importance to documents with matching content.
URL: Boosted by 0.5 to give some weight to documents with matching URLs, but less than title and content.

I used a combination of disjunction max queries and boolean queries to create a complex query that balances these boosts. This approach ensures that documents with relevant content and titles are ranked higher, while still considering the URL matches.

The search process involves:

Splitting the query string into individual terms.
Creating disjunction max queries for each term with the specified boosts.
Combining these queries into a boolean query with Occur.Must to ensure all terms are considered.
Caching the complex query for future use to improve performance.

This strategy helps achieve high relevancy in search results by prioritizing important fields and balancing the boosts effectively.

Using Proxy in Crawler

There are a lot of request_dropped signal in my Scrapy spider, due to various reasons such as robot.txt, throttling, IP bans. To handle the issue my Scrapy spider, I would employ a proxy strategy as follows:

Identify Proxy Providers: Research and select reliable proxy providers that offer rotating proxies to avoid IP bans.
Configure Proxies in Scrapy: Update the Scrapy settings to include the proxy middleware and configure it to use the selected proxies.
Implement Proxy Middleware: Write a custom middleware to dynamically assign proxies to each request.
Handle Proxy Failures: Implement logic to handle proxy failures and retry requests with a different proxy.
Monitor and Adjust: Continuously monitor the effectiveness of the proxies and adjust the strategy as needed to ensure successful crawling.

varepsilon123 / Mini-Search-Engine