transformerlab / transformerlab-app

Open Source Application for Advanced LLM Engineering: interact, train, fine-tune, and evaluate large language models on your own computer.

Home Page:https://transformerlab.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Importing External Dataset system shows 0 datasize even is imported and data is 13 Mbytes

delfireinoso opened this issue · comments

After downloading a public dataset: databricks/databricks-dolly-15k, the Imported Datasets pane shows Size=0, but in the disc the data is complete

All imported Datasets are showing 0 size, Open-Orca/OpenOrca and tatsu-lab/alpaca
from the Dataset Store, and databricks/databricks-dolly-15k, imported by the dialog from Hugginface
on .cache there are the folders
Downloads with 4 Gbytes,
tatsu-lab___alpaca 46 Mbytes
Open-Orca___open_orca with 7, 22 Gbytes
databricks___databricks-dolly-15k WITH 12,3 MBytes

So I understand the data has downloaded to your computer but I'm wondering if the app is not reading the dataset for some reason. Can you click on the Info button and confirm that there's valid data? When you click on Preview for the dataset do you see sample data from in the dataset?

I can reproduce the issue where if you enter a huggingface repo ID to download it shows as 0 bytes even though it looks ike the repo has downloaded correctly. Will update once I dig in more.

databricks/databricks-dolly-15k

instruction context response category
When did Virgin Australia start operating? Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia’s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney. Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. closed_qa
Which is a species of fish? Tope or Rope Tope classification
Why can camels survive for long without water? Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time. open_qa

Open-Orca/OpenOrca

🤗

id system_prompt question response
niv.242684 You will be given a definition of a task first, then some input of the task.

This task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.



AFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.

Output:
[

[“AFC Ajax (amateurs)”, “has ground”, “Sportpark De Toekomst”],

[“Ajax Youth Academy”, “plays at”, “Sportpark De Toekomst”]

]
flan.564327 You are an AI assistant. You will be given a task. You must generate a detailed and long answer. Generate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.

tatsu-lab/alpaca

🤗

instruction input output text
Give three tips for staying healthy. 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables.

2. Exercise regularly to keep your body active and strong.

3. Get enough sleep and maintain a consistent sleep schedule.
Below is an instruction that describes a task. Write a response that appropriately completes the request.



### Instruction:

Give three tips for staying healthy.



### Response:

1.Eat a balanced diet and make sure to include plenty of fruits and vegetables.

2. Exercise regularly to keep your body active and strong.

3. Get enough sleep and maintain a consistent sleep schedule.
What are the three primary colors? The three primary colors are red, blue, and yellow. Below is an instruction that describes a task. Write a response that appropriately completes the request.



### Instruction:

What are the three primary colors?



### Response:

The three primary colors are red, blue, and yellow.

These are the first rows of the preview, that lasted some rows not the complete dataset. The first one is form Hugginface, the other two from the Dataset Store

OK I've reproduced that when you download via the download HuggingFace button it is not saving the size correctly in the DB. The model shoudl still work. But when I downloaded from the DatasetStore it showed up with the correct size for me after download was complete. You're saying you downloaded Orca and Alpaca from teh store and they show size of 0?

Yes the two models from the Data Store show 0 size too

It's not always this way.

I had a more accurate result on a former test. I have made a test wiping the data completely

Perhaps if you download them after downloading the hugginface dataset they go wrong

I've re-downloaded databricks twice. Once it worked and once it did not. It seems the huggingface code we call to get the size sometimes doesn't have the data yet. I will see if I can find a pattern and either compensate or post something on the HF site.

I am trying to pull the number from huggingface but I guess sometimes huggingface doesn't supply it:
https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/hugging-face-dataset-download.html

I don't know why it works sometimes and not others, so will have to build in a backup or alternative way to check!

Added a check after model is downloaded. Closing this but will keep an eye open as we are going to fix a few things related to datasets, and also add more datasets.