arya-vinayak / G2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VectorBlaze πŸ”₯

We implement a Innovative Cron-Job Website Scraper with continuous streaming to the Backend Product Analytics part with G2 that blazingly processess 1000+ different products in less than 10 seconds.

We have Focused on Fast Web scraping and Blazing analysis of the product leveraging tools like Qdrant Vector Database , Vector similarity indexing and Fast API Interaction

Architecture πŸ—ΊοΈ

image

Key Features πŸ”‘

  • Web Scraping with Selenium: Utilizes Selenium to scrape data from websites periodically as a cron job, ensuring compatibility with dynamic web pages and periodic updates.
  • Lightning Fast Communication using Redis Pub/Sub: Integrates Redis Pub/Sub to efficiently communicate between components.
  • Data Analysis and Indexing: Employs similarity indexing to quickly check if scraped data already exists in the database.
  • Vector Database: Utilizes the power of Vector Database for efficient data storage and retrieval, enhanced by custom queries and a neural engine for faster search.
  • Streamlit Frontend: Offers a user-friendly interface for visualizing products not present in the database, enhancing user interaction and data exploration.

Performance Highlights 🏎

  • Speed: Processes over 1000+ products in just 10 seconds, showcasing the framework's high-speed capabilities.
  • Efficiency: Utilizes similarity indexing to quickly identify existing data, reducing unnecessary processing.
  • Scalability: Designed with scalability in mind, allowing for easy expansion and integration of new features.

Prerequisites 😎

  • Python 3.6+
  • Redis Stack (local or cloud-based)
  • Selenium WebDriver (local or containerised)
  • qdrant vector db
  • Drive to have fun!

Run Locally - start cookin... πŸ’»

Setting Up Redis Stack Container

To start a Redis Stack container using the redis-stack image, follow these steps:

  1. Run the following command in your terminal:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

This command launches a Redis Stack container and exposes RedisInsight on port 8001. You can access RedisInsight by opening your browser and navigating to localhost:8001.


Setting Up Qdrant

To set up Qdrant, follow these steps:

  1. Download the Qdrant image from DockerHub:
docker pull qdrant/qdrant
  1. Start Qdrant inside Docker with the following command:
docker run -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Once Qdrant is running, you can access its Web UI by navigating to localhost:6333/dashboard.


Setting Up Environment Variables

Before proceeding further, create a .env file in your project directory and add the following line:

BEARER_TOKEN=your_bearer_token_here

Replace your_bearer_token_here with your actual bearer token.


Generating Data

Now, generate the required data by following these steps:

  1. Run the ProductsCollector.py script to create G2_Products.json.
  2. Pre-process G2_Products.json to produce G2_Cleaned.json.

Building the Neural Search Engine

To build the neural search engine, follow these steps:

  1. Open the Qdrant_store.ipynb notebook.
  2. Sequentially run the cells in the notebook to vectorize the G2 Products data and prepare the neural search engine.

Running the Processor

Once everything is set up, run the processor.py script to perform high-speed processing:

python processor.py

This script will handle the heavy lifting tasks.


Running Selenium Web Scraper

To set up and run the Selenium web scraper, setup the container:

docker run -d -p 4000:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome
  • installing python libraries
pip install -r requirements.txt
  • run selenium web scraper
python3 scraper/fetchSourceForge.py
  • visit http://localhost:4000/ to see scraper in action

Now your environment should be set up and ready to go! If you encounter any issues, feel free to reach out for assistance.

References 🌐

About


Languages

Language:Python 83.4%Language:Jupyter Notebook 16.6%