Chat-with-Document-s-using-ChatGPT-API-and-Text-Embedding

Main idea

The short answer is that they convert documents that are over 100 or even 1,000 pages long into a numeric representation of data and related context (vector embedding) and store them in a vector search engine. When a user chats with the document (i.e., asks questions), the system searches and returns similar text to a question (i.e., chat) using an algorithm called Approximate Nearest Neighbor search (ANN). It looks something like this.

Fig.1 What is vector search?

Resources (images, documents, audio)
vector representation
Nearest neighbor
Output/results

The program then includes the returned text that is similar to a question (i.e., chat) in a prompt and asks the same question again to the OpenAI GPT-3 API. This returns a nice response that you are used to with ChatGPT. The benefits of this approach are that the prompt is much smaller, not 1,000 pages of documents, and the user gets the answer they are looking for.

On a side note, if you are worried about something like ChatGPT providing incorrect answers to a question, then you can point it to your organization’s knowledge base and provide answers from there. The accuracy of the answers will be as good as your knowledge base.

Product introduction to TraceTalk 💬

Stepping into the future of conversational AI, we are thrilled to introduce TraceTalk, a cutting-edge solution that blends state-of-the-art artificial intelligence with interactive learning. By reshaping the traditional approach to knowledge acquisition, we are making learning as dynamic, engaging, and personable as conversing with a close friend. 🚀

Roadmap of TraceTalk:

The Novelty of TraceTalk 📚

Ditching the stereotype of conventional chatbots, TraceTalk is conceived as an AI-fueled conversational companion with a primary aim to facilitate learning in an exciting and immersive manner. Deriving insights from New Bing’s innovative interactive solutions, TraceTalk is engineered to be more than a mere provider of information. It morphs into your personal and interactive learning assistant, designed to transform the way we learn and interact with information, which seamlessly integrates relevant references by providing correct links, illustrations, video jump links, code blocks, and other components. This integration provides a rich, context-driven learning experience that extends beyond the usual confines of traditional learning mechanisms.

The core of our AI taps into the prodigious knowledge repository available on Ocademy, an open-source educational initiative. As a result, TraceTalk is empowered to pull data from an incessantly evolving and expanding base of knowledge, ensuring your interactions are always current, informative, and enriching.

A Comprehensive Technological Framework 🏗️

At the heart of TraceTalk is an intricate and scalable architecture that is built on the harmonious blending of modern technologies. Our product is focused on leveraging the strengths of Large Language Models (LLMs), allowing us to cater to specific domain expertise and not stifling innovation. It epitomizes the potential of cloud-centric, microservices-based, and data-driven systems. It stores library-like knowledge in databases and presents it to the user when needed in an understandable and efficient way. This advanced structure amplifies the product's flexibility, scalability, and high availability, assuring a smooth and uninterrupted user experience regardless of the volume of interactions.

Our frontend is bolstered by the efficient Next.js framework which excels at server-side rendering. The backend, driven by Python, harnesses the language's power to manage complex data processing tasks and administer sophisticated business logic. These components are interconnected via a RESTful API, facilitating communication with our high-performance Qdrant database hosted on the cloud. ☁️

A Technological Marvel 🤖

The uniqueness of TraceTalk is deeply rooted in its technological infrastructure. Powering this conversational AI marvel is OpenAI's GPT-3.5 Turbo API, positioning it amongst the most sophisticated conversational AIs currently available. In addition, our engineering team has innovatively utilized Python's multithreading capabilities and queue systems to handle substantial traffic, thereby ensuring seamless interactions irrespective of the scale.

In the realm of database management, TraceTalk leverages the substantial benefits of cloud technology. By opting to train our models and store our data on the cloud, we have geared TraceTalk to handle massive volumes of data and deliver the optimal conversational experience.

The Dawn of Conversational Evolution 💬

TraceTalk transcends the boundaries of a standard chatbot - it signifies the dawn of a new era in conversational interaction. As we continue to expand and evolve, we eagerly look forward to integrating more features and refining the user experience. For now, we are proud to present an AI conversational assistant that is not just technologically advanced, but also committed to delivering a smooth, intuitive, and engaging user experience. Welcome to the future of conversation, and welcome to TraceTalk! 🎉

Our Vision: The Future of Learning 🎯

At TraceTalk, we firmly believe that the potential of AI and machine learning extends beyond technological advancement - it holds the power to revolutionize the way we learn, interact, and communicate. Our vision is to harness this vast potential and shape a world where information is not just readily accessible but interactive, where learning transforms from a monotonous task to a fascinating conversation. We invite you to join us on this thrilling journey as we collectively shape the future of learning. Welcome aboard TraceTalk! 🌟

In addition, TraceTalk supports what we call 'selfish data.' Our database currently hosts over 400 documents in private domains 🔐, and yet we're just scratching the surface of the model's capacity.

We also acknowledge that interactions may occasionally veer off-topic or include inappropriate content 🙅. To manage such situations, we've built in mechanisms to filter out these interactions and guide the conversation back to its intended course.

In essence, TraceTalk is more than a chatbot—it's the next step in interactive learning. Our commitment to comprehensive implementation and meticulous project management ensures not only accurate and relevant responses but a seamless, informative, and engaging user experience. Welcome to the future of conversation, and welcome to TraceTalk 🚀.

An answer from TraceTalk:

Installation Guide for TraceTalk

This guide will walk you through the steps necessary to install and run the "Chat-with-Document-s-using-ChatGPT-API-and-Text-Embedding" application.

Prerequisites

Git
Python 3.8 or above
Node.js v14 or above
npm v6.14 or above

Installation Steps

1. Clone the GitHub repository

Clone the repository from GitHub using the following command, remember to include the "--recursive" option to clone the submodules as well:

git clone --recursive https://github.com/Appointat/Chat-with-Document-s-using-ChatGPT-API-and-Text-Embedding.git

2. Create a ".env" file

Navigate into the cloned repository and create a new file named ".env". This file will contain your configuration variables. At the moment, you need to fill in the following two:

OPENAI_API_KEY=<Your OpenAI API Key>
QDRANT_URL=<Qdrant URL>
QDRANT_API_KEY=<Qdrant API Key>

For QDRANT_URL and QDRANT_API_KEY, you need to contact the project administrator to obtain them.

3. Install and run the backend

Navigate to the TraceTalk directory :

cd TraceTalk

Install the required Python packages:

pip install -r requirements.txt

Start the backend service:

python app.py

4. Install and run the frontend

Navigate to the chatbot-ui directory:

cd chatbot-ui

Install the required Node.js packages:

npm install

Start the frontend service:

npm run dev

5. Access the application

Finally, open your web browser and navigate to http://localhost:3000 .

You should now be able to see and interact with the "Chat-with-Document-s-using-ChatGPT-API-and-Text-Embedding" application. Enjoy exploring!

Troubleshooting

If you run into any issues during the installation process, please feel free to contact the project administrator or raise an issue on the GitHub repository.

Tutorial

Tools

We will be using three tools in this tutorial:

OpenAI GPT-3, specifically the new ChatGPT API (gpt-3.5-turbo). Not because this model is any better than other models, but because it is cheaper ($0.002 / 1K tokens) and good enough for this use case.
Chroma, the AI-native open-source embedding database (i.e., vector search engine). Chroma is an easy-to-use vector database when used in conjunction with LangChain; otherwise, it’s kind of unusable. If you want to deploy these types of applications in production, I recommend using Elasticsearch because it has wide adoption and has been around for years. Not because Elasticsearch is better than competitors, but because not many organizations like to add a new technology stack*.*
LangChain, is a library that aims to assist developers in building applications that use Large Language Models (LLMs) by allowing them to integrate these models with other sources of computation or knowledge.

Data

We will be using the data from Project Gutenberg’s “Romeo and Juliet by William Shakespeare”, which consists of 55,985 tokens. This makes it a nicely sized dataset.

Python code

Installation of packages:

$ writefile requirements.txt
openai
chromadb
langchain
tiktoken

$ pip install -r requirements.txt

Note: administrator privileges may be required to install these packages.

Import Python Packages

import os
import platform

import openai
import chromadb
import langchain

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ChatVectorDBChain
from langchain.document_loaders import GutenbergLoader

print('Python: ', platform.python_version())

Mount Google Drive on Colab

from google.colab import drive
drive.mount('/content/drive')

OpenAI API Key

os.environ["OPENAI_API_KEY"] = 'your openai api key'

Configure Chroma

Chroma uses both of my favorite technologies for their backend — DuckDB and Apache Parquet — but by default, it uses an in-memory database. This is fine for this tutorial, but I want to give you the option of storing the database file on a disk so you can reuse the database without paying for embedding it every single time.

persist_directory="/content/drive/My Drive/Colab Notebooks/chroma/romeo"

Convert Document to Embedding

Convert the document, i.e., the book, to vector embedding and store it in a vector search engine, i.e., a vector database.

def get_gutenberg(url):
    loader = GutenbergLoader(url)
    data = loader.load()
    return data

romeoandjuliet_data = get_gutenberg('https://www.gutenberg.org/cache/epub/1513/pg1513.txt')

text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)
romeoandjuliet_doc = text_splitter.split_documents(romeoandjuliet_data)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(romeoandjuliet_doc, embeddings, persist_directory=persist_directory)
vectordb.persist()

The first step is a bit self-explanatory, but it involves using ‘from langchain.document_loaders import GutenbergLoader’ to load a book from Project Gutenberg.
The second step is more involved. To obtain an embedding, we need to send the text string, i.e., the book, to OpenAI’s embeddings API endpoint along with a choice of embedding model ID, e.g., text-embedding-ada-002. The response will contain an embedding. However, since the book consists of 55,985 tokens and the token limit for the text-embedding-ada-002 model is 2,048 tokens, we use the ‘text_splitter’ utility (from ‘langchain.text_splitter import TokenTextSplitter’) to split the book into manageable 1,000-token chunks. The following is an illustration of a sample embedding response from OpenAI. If you’re wondering, the pricing for the embedding model is $0.0004 / 1K tokens.
The third step is pretty straightforward: we store the embedding in Chroma, our vector search engine, and persist it on a file system.

Configure LangChain QA

To configure LangChain QA with Chroma, use the OpenAI GPT-3 model (model_name='gpt-3.5-turbo') and ensure that the response includes the intermediary step of a result from a vector search engine, i.e., Chroma (set return_source_documents=True).

romeoandjuliet_qa = ChatVectorDBChain.from_llm(OpenAI(temperature=0, model_name="gpt-3.5-turbo"), vectordb, return_source_documents=True)

Questions & Answers with “Romeo and Juliet” Book

Generating questions and answers from the book is a straightforward process. To assess the accuracy of the results, I will be comparing the answers with those from SparkNotes.

SparkNotes editors. “Romeo and Juliet” SparkNotes.com, SparkNotes LLC, 2005

I hope you have enjoyed this simple tutorial. If you have any questions or comments, please provide them here.

Resources

Modules of LangChain

There are several main modules that LangChain provides support for. For each module we provide some examples to get started, how-to guides, reference docs, and conceptual guides. These modules are, in increasing order of complexity:

Models: LLMs, Chat models, Text embedding models
Prompts: LLM prompts templates, Chat prompt templates, Example selectors, Output Parers
Indexes: The primary index and retrieval types supported by LangChain are currently centered around vector databases, and therefore a lot of the functionality we dive deep on those topics.
Memory: LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.
Chains: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.
Agents: Tools, Agents, Toolkits, Agent exucutor

Appointat / Chat-with-Document-s-using-ChatGPT-API-and-Text-Embedding