huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support the deserialization of json lines files comprised of lists

umarbutler opened this issue · comments

Feature request

I manage a somewhat large and popular Hugging Face dataset known as the Open Australian Legal Corpus. I recently updated my corpus to be stored in a json lines file where each line is an array and each element represents a value at a particular column. Previously, my corpus was stored as a json lines file where each line was a dictionary and the keys were the fields.

Essentially, a line in my json lines file used to look like this:

{"version_id":"","type":"","jurisdiction":"","source":"","citation":"","url":"","when_scraped":"","text":""}

And now it looks like this:

["","","","","","","",""]

This saves 65 bytes per document and allows me very quickly serialise and deserialise documents via msgspec.

After making this change, I found that datasets was incapable of deserialising my Corpus without a custom loading script, even if I ensured that the dataset_info field in my dataset card contained the desired names of my features.

I would like to request that functionality be added to support this format which is more memory-efficent and faster than using dictionaries.

Motivation

The documentation for creating dataset loading scripts asserts that:

In the next major release, the new safety features of 🤗 Datasets will disable running dataset loading scripts by default, and you will have to pass trust_remote_code=True to load datasets that require running a dataset script.

I would rather not require my users to pass trust_remote_code=True which means that I will need built-in support for this format.

Your contribution

I would be happy to submit a PR for this if this is something you would incorporate into datasets and if I can be pointed to where the code would need to go.

Update: I ended up deciding to go back to use lines of dictionaries instead of arrays, not because of this issue as my users would be capable of downloading my corpus without datasets, but the speed and storage savings are not currently worth breaking my API and harming the backwards compatibility of each new revision.

With that said, for a static dataset that is not regularly updated like mine, and particularly for extremely large datasets with millions or billions of rows, using arrays could have a meaningful impact, and so there is probably still value in supporting this structure, provided the effort is not too much.