cryptobeaver / ADSCB

Another Document Scraper Chat Bot

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLM-BSD Project

This project is a Python-based application that utilizes Streamlit for building interactive web applications and integrates with LanceDB for data storage and retrieval. It also leverages the Cohere API for natural language processing tasks.


Before running the project, make sure you have the following:

  • Python 3.x installed
  • Poetry installed (for dependency management)
  • Cohere API key
  • LanceDB database set up

Project Structure

The project consists of the following main files:

  • Scrapes data from a specified website and saves it as HTML files.
  • Parses the scraped HTML files and extracts relevant content.
  • Processes the parsed data, generates embeddings, and stores them in a LanceDB table.
  • Implements a FastAPI endpoint that accepts user queries, searches the LanceDB table, and generates responses using the Cohere API.
  • Builds a Streamlit web application that allows users to interact with the FastAPI endpoint and view the generated responses.


  1. Clone the repository:

    git clone
  2. Install Poetry if you haven't already:

    pip install poetry
  3. Install the project dependencies using Poetry:

    poetry install
  4. Set up the Cohere API key:

    • Create a .env file in the project root directory.
    • Add the following line to the .env file:
    • Replace your-api-key with your actual Cohere API key.
  5. Set up the LanceDB database:

    • Make sure you have LanceDB installed and running.
    • Update the database connection details in the relevant files ( and if necessary.


  1. Run the scraper to fetch data from the desired website:

    poetry run python
    • Modify the base_url, output_dir, filter_pattern, and sitemap_filename variables in according to your requirements.
  2. Parse the scraped HTML files:

    poetry run python
    • The parsed data will be saved as .parsed files in the same directory as the scraped HTML files.
  3. Process the parsed data and store it in the LanceDB table:

    poetry run python
    • Update the base_url_for_docs, table_name, and folder_path variables in to match your setup.
  4. Start the FastAPI endpoint:

    poetry run uvicorn run_api_endpoint:app --reload
    • The API endpoint will be accessible at http://localhost:8000/ask.
  5. Run the Streamlit frontend:

    poetry run streamlit run
    • The Streamlit web application will be accessible at http://localhost:8501.
  6. Interact with the web application by entering your queries and viewing the generated responses.

Static Changes

  • In the file, make sure to update the following variables according to your specific setup:

    • base_url_for_docs: Set it to the base URL of the documentation website you scraped.
    • table_name: Specify the name of the LanceDB table where you want to store the processed data.
    • folder_path: Set it to the directory path where the parsed data files (.parsed) are located.
  • In the file, ensure that the table_name variable matches the name of the LanceDB table you specified in


Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.


This project is licensed under the MIT License.


Another Document Scraper Chat Bot


Language:Python 100.0%