Importing External Dataset system shows 0 datasize even is imported and data is 13 Mbytes

Question

Importing External Dataset system shows 0 datasize even is imported and data is 13 Mbytes

delfireinoso opened this issue 4 months ago · comments

After downloading a public dataset: databricks/databricks-dolly-15k, the Imported Datasets pane shows Size=0, but in the disc the data is complete

Delfi Reinoso · Answer 1 · Sat Apr 13 2024 19:30:40 GMT+0800 (China Standard Time)

All imported Datasets are showing 0 size, Open-Orca/OpenOrca and tatsu-lab/alpaca
from the Dataset Store, and databricks/databricks-dolly-15k, imported by the dialog from Hugginface
on .cache there are the folders
Downloads with 4 Gbytes,
tatsu-lab___alpaca 46 Mbytes
Open-Orca___open_orca with 7, 22 Gbytes
databricks___databricks-dolly-15k WITH 12,3 MBytes

Tony Salomone · Answer 2 · Mon Apr 15 2024 21:59:15 GMT+0800 (China Standard Time)

So I understand the data has downloaded to your computer but I'm wondering if the app is not reading the dataset for some reason. Can you click on the Info button and confirm that there's valid data? When you click on Preview for the dataset do you see sample data from in the dataset?

Tony Salomone · Answer 3 · Mon Apr 15 2024 22:05:14 GMT+0800 (China Standard Time)

I can reproduce the issue where if you enter a huggingface repo ID to download it shows as 0 bytes even though it looks ike the repo has downloaded correctly. Will update once I dig in more.

Delfi Reinoso · Answer 4 · Mon Apr 15 2024 22:14:10 GMT+0800 (China Standard Time)

databricks/databricks-dolly-15k


instruction	context	response	category
When did Virgin Australia start operating?	Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia’s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.	Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.	closed_qa
Which is a species of fish? Tope or Rope		Tope	classification
Why can camels survive for long without water?		Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.	open_qa

Open-Orca/OpenOrca

🤗


id	system_prompt	question	response
niv.242684		You will be given a definition of a task first, then some input of the task. This task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them. AFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play. Output:	[ [“AFC Ajax (amateurs)”, “has ground”, “Sportpark De Toekomst”], [“Ajax Youth Academy”, “plays at”, “Sportpark De Toekomst”] ]
flan.564327	You are an AI assistant. You will be given a task. You must generate a detailed and long answer.	Generate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One	Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.

tatsu-lab/alpaca

🤗


instruction	input	output	text
Give three tips for staying healthy.		1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule.	Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Give three tips for staying healthy. ### Response: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule.
What are the three primary colors?		The three primary colors are red, blue, and yellow.	Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What are the three primary colors? ### Response: The three primary colors are red, blue, and yellow.

Delfi Reinoso · Answer 5 · Mon Apr 15 2024 22:15:12 GMT+0800 (China Standard Time)

These are the first rows of the preview, that lasted some rows not the complete dataset. The first one is form Hugginface, the other two from the Dataset Store

Tony Salomone · Answer 6 · Mon Apr 15 2024 23:39:37 GMT+0800 (China Standard Time)

OK I've reproduced that when you download via the download HuggingFace button it is not saving the size correctly in the DB. The model shoudl still work. But when I downloaded from the DatasetStore it showed up with the correct size for me after download was complete. You're saying you downloaded Orca and Alpaca from teh store and they show size of 0?

Delfi Reinoso · Answer 7 · Tue Apr 16 2024 00:36:39 GMT+0800 (China Standard Time)

Yes the two models from the Data Store show 0 size too

It's not always this way.

I had a more accurate result on a former test. I have made a test wiping the data completely

Perhaps if you download them after downloading the hugginface dataset they go wrong

Tony Salomone · Answer 8 · Tue Apr 16 2024 01:39:06 GMT+0800 (China Standard Time)

I've re-downloaded databricks twice. Once it worked and once it did not. It seems the huggingface code we call to get the size sometimes doesn't have the data yet. I will see if I can find a pattern and either compensate or post something on the HF site.

Tony Salomone · Answer 9 · Tue Apr 16 2024 03:14:46 GMT+0800 (China Standard Time)

I am trying to pull the number from huggingface but I guess sometimes huggingface doesn't supply it:
https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/hugging-face-dataset-download.html

I don't know why it works sometimes and not others, so will have to build in a backup or alternative way to check!

Tony Salomone · Answer 10 · Wed Apr 17 2024 05:03:34 GMT+0800 (China Standard Time)

Added a check after model is downloaded. Closing this but will keep an eye open as we are going to fix a few things related to datasets, and also add more datasets.