peterw / Chat-with-Github-Repo

This repository contains two Python scripts that demonstrate how to create a chatbot using Streamlit, OpenAI GPT-3.5-turbo, and Activeloop's Deep Lake.

Home Page:https://explodinginsights.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't replicate the intended behavior

rnabirov opened this issue · comments

After installing the app and scraping the repo you referred to in the demo (https://github.com/peterw/Gumroad-Landing-Page-Generator) I can't get the chat to analyze the repo.

This is my chat interaction using the same questions as in the demo. Looks like the repo data embeddings are not used properly in inferences.
image

This is the logging output in the terminal, not sure if it's relevant.

2023-04-27 19:05:51.248 Deep Lake Dataset in my_test_dataset already exists, loading from the storage
Dataset(path='my_test_dataset', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype    shape    dtype  compression
  -------   -------  -------  -------  ------- 
 embedding  generic   (0,)    float32   None   
    ids      text     (0,)      str     None   
 metadata    json     (0,)      str     None   
   text      text     (0,)      str     None   
2023-04-27 19:05:51.255 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.

@rnabirov Cab you try deleting the my_test_dataset folder, run github.py, then chat.py?

Did it a few times, same result.

What does the output from github.py looks like?

Cloning into './gumroad'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 27 (delta 6), reused 11 (delta 2), pack-reused 0
Unpacking objects: 100% (27/27), done.
Created a chunk of size 1525, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.3.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/rnabirov/my_test_repo3
hub://rnabirov/my_test_repo3 loaded successfully.
Evaluating ingest: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00
Dataset(path='hub://rnabirov/my_test_repo3', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (56, 1536)  float32   None   
    ids      text     (56, 1)      str     None   
 metadata    json     (56, 1)      str     None   
   text      text     (56, 1)      str     None   

You are ingesting to a cloud dataset with github.py, but chat.py seem to be loading a local dataset. Can I see your .env file (after removing api keys)?

here it is

OPENAI_API_KEY=""
ACTIVELOOP_TOKEN=""
DEEPLAKE_USERNAME=rnabirov
DEEPLAKE_DATASET_PATH=my_test_dataset
DEEPLAKE_REPO_NAME=my_test_repo3
REPO_URL=https://github.com/peterw/Gumroad-Landing-Page-Generator
SITE_TITLE="Repo analysis chat"

probably chat.py downloads an empty dataset? the whole my_test_dataset folder is 9000 bytes. tensor_meta.json files in the folder are 400 bytes max

Probably a dumb question. What's the point of downloading a dataset to the local machine, when it's available at activeloop? The script anyway uses outside connections to openai, it might as well work with the remote dataset at activeloop.

i got it working by pointing DEEPLAKE_DATASET_PATH in .env to the remote dataset which was created at activeloop by github.py. Having separate variables DEEPLAKE_DATASET_PATH and DEEPLAKE_REPO_NAME for the same dataset was confusing for me. I'd suggest combining both

hey @peterw i think we can close the issue and add the approach suggested by @rnabirov , I've gotten some questions on this myself too.