Unable to load JSON saved using `to_json`

Question

Unable to load JSON saved using `to_json`

DarshanDeshpande opened this issue 3 months ago · comments

Describe the bug

Datasets stored in the JSON format cannot be loaded using json.load()

Steps to reproduce the bug

import json
from datasets import load_dataset

dataset = load_dataset("squad")
train_dataset, test_dataset = dataset["train"], dataset["validation"]
test_dataset.to_json("full_dataset.json")

# This works
loaded_test = load_dataset("json", data_files="full_dataset.json")

# This fails
loaded_test = json.load(open("full_dataset.json", "r"))

Expected behavior

The JSON should be correctly formatted when writing so that it can be loaded using json.load().

Environment info

Colab: https://colab.research.google.com/drive/1st1iStFUVgu9ZPvnzSzL4vDeYWDwYpUm?usp=sharing

Albert Villanova del Moral · Answer 1 · Sun May 12 2024 14:39:48 GMT+0800 (China Standard Time)

Hi @DarshanDeshpande,

Please note that the default format of the method Dataset.to_json is JSON-Lines: it passes orient="records", lines=True to pandas.DataFrame.to_json. This format is specially useful for large datasets, since unlike regular JSON files, it does not require loading all the data into memory at once, but can be done iteratively by batches.

In order to read this file using the json library, you should parse line by line:

with open("full_dataset.json", "r") as f:
    data = [json.loads(line) for line in f]
len(data)

Maybe we should explain this better in our docs.

Albert Villanova del Moral · Answer 2 · Thu May 16 2024 22:32:54 GMT+0800 (China Standard Time)

Now we explain this better in out docs:

#6895