Unable to load JSON saved using `to_json`
DarshanDeshpande opened this issue · comments
Describe the bug
Datasets stored in the JSON format cannot be loaded using json.load()
Steps to reproduce the bug
import json
from datasets import load_dataset
dataset = load_dataset("squad")
train_dataset, test_dataset = dataset["train"], dataset["validation"]
test_dataset.to_json("full_dataset.json")
# This works
loaded_test = load_dataset("json", data_files="full_dataset.json")
# This fails
loaded_test = json.load(open("full_dataset.json", "r"))
Expected behavior
The JSON should be correctly formatted when writing so that it can be loaded using json.load()
.
Environment info
Colab: https://colab.research.google.com/drive/1st1iStFUVgu9ZPvnzSzL4vDeYWDwYpUm?usp=sharing
Please note that the default format of the method Dataset.to_json
is JSON-Lines: it passes orient="records", lines=True
to pandas.DataFrame.to_json
. This format is specially useful for large datasets, since unlike regular JSON files, it does not require loading all the data into memory at once, but can be done iteratively by batches.
In order to read this file using the json
library, you should parse line by line:
with open("full_dataset.json", "r") as f:
data = [json.loads(line) for line in f]
len(data)
Maybe we should explain this better in our docs.
Now we explain this better in out docs: