Can't replicate the intended behavior

Question

Can't replicate the intended behavior

rnabirov opened this issue a year ago · comments

After installing the app and scraping the repo you referred to in the demo (https://github.com/peterw/Gumroad-Landing-Page-Generator) I can't get the chat to analyze the repo.

This is my chat interaction using the same questions as in the demo. Looks like the repo data embeddings are not used properly in inferences.

This is the logging output in the terminal, not sure if it's relevant.

2023-04-27 19:05:51.248 Deep Lake Dataset in my_test_dataset already exists, loading from the storage
Dataset(path='my_test_dataset', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype    shape    dtype  compression
  -------   -------  -------  -------  ------- 
 embedding  generic   (0,)    float32   None   
    ids      text     (0,)      str     None   
 metadata    json     (0,)      str     None   
   text      text     (0,)      str     None   
2023-04-27 19:05:51.255 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.

Fariz Rahman · Answer 1 · Fri Apr 28 2023 16:26:26 GMT+0800 (China Standard Time)

@rnabirov Cab you try deleting the my_test_dataset folder, run github.py, then chat.py?

rnabirov · Answer 2 · Fri Apr 28 2023 16:29:06 GMT+0800 (China Standard Time)

Did it a few times, same result.

Fariz Rahman · Answer 3 · Fri Apr 28 2023 18:22:05 GMT+0800 (China Standard Time)

What does the output from github.py looks like?

rnabirov · Answer 4 · Fri Apr 28 2023 19:06:33 GMT+0800 (China Standard Time)

Cloning into './gumroad'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 27 (delta 6), reused 11 (delta 2), pack-reused 0
Unpacking objects: 100% (27/27), done.
Created a chunk of size 1525, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.3.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/rnabirov/my_test_repo3
hub://rnabirov/my_test_repo3 loaded successfully.
Evaluating ingest: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00
Dataset(path='hub://rnabirov/my_test_repo3', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (56, 1536)  float32   None   
    ids      text     (56, 1)      str     None   
 metadata    json     (56, 1)      str     None   
   text      text     (56, 1)      str     None

Fariz Rahman · Answer 5 · Fri Apr 28 2023 19:12:27 GMT+0800 (China Standard Time)

You are ingesting to a cloud dataset with github.py, but chat.py seem to be loading a local dataset. Can I see your .env file (after removing api keys)?

rnabirov · Answer 6 · Fri Apr 28 2023 19:18:39 GMT+0800 (China Standard Time)

here it is

OPENAI_API_KEY=""
ACTIVELOOP_TOKEN=""
DEEPLAKE_USERNAME=rnabirov
DEEPLAKE_DATASET_PATH=my_test_dataset
DEEPLAKE_REPO_NAME=my_test_repo3
REPO_URL=https://github.com/peterw/Gumroad-Landing-Page-Generator
SITE_TITLE="Repo analysis chat"

rnabirov · Answer 7 · Fri Apr 28 2023 19:26:05 GMT+0800 (China Standard Time)

probably chat.py downloads an empty dataset? the whole my_test_dataset folder is 9000 bytes. tensor_meta.json files in the folder are 400 bytes max

Probably a dumb question. What's the point of downloading a dataset to the local machine, when it's available at activeloop? The script anyway uses outside connections to openai, it might as well work with the remote dataset at activeloop.

rnabirov · Answer 8 · Sat Apr 29 2023 13:03:49 GMT+0800 (China Standard Time)

i got it working by pointing DEEPLAKE_DATASET_PATH in .env to the remote dataset which was created at activeloop by github.py. Having separate variables DEEPLAKE_DATASET_PATH and DEEPLAKE_REPO_NAME for the same dataset was confusing for me. I'd suggest combining both

mikayelh · Answer 9 · Tue May 02 2023 05:20:12 GMT+0800 (China Standard Time)

hey @peterw i think we can close the issue and add the approach suggested by @rnabirov , I've gotten some questions on this myself too.