huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to load JSON saved using `to_json`

DarshanDeshpande opened this issue · comments

Describe the bug

Datasets stored in the JSON format cannot be loaded using json.load()

Steps to reproduce the bug

import json
from datasets import load_dataset

dataset = load_dataset("squad")
train_dataset, test_dataset = dataset["train"], dataset["validation"]
test_dataset.to_json("full_dataset.json")

# This works
loaded_test = load_dataset("json", data_files="full_dataset.json")

# This fails
loaded_test = json.load(open("full_dataset.json", "r"))

Expected behavior

The JSON should be correctly formatted when writing so that it can be loaded using json.load().

Environment info

Colab: https://colab.research.google.com/drive/1st1iStFUVgu9ZPvnzSzL4vDeYWDwYpUm?usp=sharing

Hi @DarshanDeshpande,

Please note that the default format of the method Dataset.to_json is JSON-Lines: it passes orient="records", lines=True to pandas.DataFrame.to_json. This format is specially useful for large datasets, since unlike regular JSON files, it does not require loading all the data into memory at once, but can be done iteratively by batches.

In order to read this file using the json library, you should parse line by line:

with open("full_dataset.json", "r") as f:
    data = [json.loads(line) for line in f]
len(data)

Maybe we should explain this better in our docs.

Now we explain this better in out docs: