Can't replicate the intended behavior
rnabirov opened this issue · comments
After installing the app and scraping the repo you referred to in the demo (https://github.com/peterw/Gumroad-Landing-Page-Generator) I can't get the chat to analyze the repo.
This is my chat interaction using the same questions as in the demo. Looks like the repo data embeddings are not used properly in inferences.
This is the logging output in the terminal, not sure if it's relevant.
2023-04-27 19:05:51.248 Deep Lake Dataset in my_test_dataset already exists, loading from the storage
Dataset(path='my_test_dataset', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (0,) float32 None
ids text (0,) str None
metadata json (0,) str None
text text (0,) str None
2023-04-27 19:05:51.255 `label` got an empty value. This is discouraged for accessibility reasons and may be disallowed in the future by raising an exception. Please provide a non-empty label and hide it with label_visibility if needed.
@rnabirov Cab you try deleting the my_test_dataset folder, run github.py, then chat.py?
Did it a few times, same result.
What does the output from github.py looks like?
Cloning into './gumroad'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 27 (delta 6), reused 11 (delta 2), pack-reused 0
Unpacking objects: 100% (27/27), done.
Created a chunk of size 1525, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
/Users/rnabirov/opt/anaconda3/lib/python3.8/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.3.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
warnings.warn(
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/rnabirov/my_test_repo3
hub://rnabirov/my_test_repo3 loaded successfully.
Evaluating ingest: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00
Dataset(path='hub://rnabirov/my_test_repo3', tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (56, 1536) float32 None
ids text (56, 1) str None
metadata json (56, 1) str None
text text (56, 1) str None
You are ingesting to a cloud dataset with github.py, but chat.py seem to be loading a local dataset. Can I see your .env file (after removing api keys)?
here it is
OPENAI_API_KEY=""
ACTIVELOOP_TOKEN=""
DEEPLAKE_USERNAME=rnabirov
DEEPLAKE_DATASET_PATH=my_test_dataset
DEEPLAKE_REPO_NAME=my_test_repo3
REPO_URL=https://github.com/peterw/Gumroad-Landing-Page-Generator
SITE_TITLE="Repo analysis chat"
probably chat.py downloads an empty dataset? the whole my_test_dataset folder is 9000 bytes. tensor_meta.json files in the folder are 400 bytes max
Probably a dumb question. What's the point of downloading a dataset to the local machine, when it's available at activeloop? The script anyway uses outside connections to openai, it might as well work with the remote dataset at activeloop.
i got it working by pointing DEEPLAKE_DATASET_PATH in .env to the remote dataset which was created at activeloop by github.py. Having separate variables DEEPLAKE_DATASET_PATH and DEEPLAKE_REPO_NAME for the same dataset was confusing for me. I'd suggest combining both