huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

List of dictionary features get standardized

sohamparikh opened this issue · comments

Describe the bug

Hi, i’m trying to create a HF dataset from a list using Dataset.from_list.

Each sample in the list is a dict with the same keys (which will be my features). The values for each feature are a list of dictionaries, and each such dictionary has a different set of keys. However, the datasets library standardizes all dictionaries under a feature and adds all possible keys (with None value) from all the dictionaries under that feature.

How can I keep the same set of keys as in the original list for each dictionary under a feature?

Steps to reproduce the bug

from datasets import Dataset

# Define a function to generate a sample with "tools" feature
def generate_sample():
    # Generate random sample data
    sample_data = {
        "text": "Sample text",
        "feature_1": []
    }
    
    # Add feature_1 with random keys for this sample
    feature_1 = [{"key1": "value1"}, {"key2": "value2"}]  # Example feature_1 with random keys
    sample_data["feature_1"].extend(feature_1)
    
    return sample_data

# Generate multiple samples
num_samples = 10
samples = [generate_sample() for _ in range(num_samples)]

# Create a Hugging Face Dataset
dataset = Dataset.from_list(samples)
dataset[0]

{'text': 'Sample text', 'feature_1': [{'key1': 'value1', 'key2': None}, {'key1': None, 'key2': 'value2'}]}

Expected behavior

{'text': 'Sample text', 'feature_1': [{'key1': 'value1'}, {'key2': 'value2'}]}

Environment info

  • datasets version: 2.19.1
  • Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • huggingface_hub version: 0.23.0
  • PyArrow version: 15.0.0
  • Pandas version: 2.2.0
  • fsspec version: 2023.10.0