```push_to_hub()``` - Prevent Automatic Generation of Splits

Question

```push_to_hub()``` - Prevent Automatic Generation of Splits

jetlime opened this issue 2 months ago · comments

Describe the bug

I currently have a dataset which has not been splited. When pushing the dataset to my hugging face dataset repository, it is split into a testing and training set. How can I prevent the split from happening?

Steps to reproduce the bug

Have a unsplit dataset

Dataset({ features: ['input', 'output', 'Attack', '__index_level_0__'], num_rows: 944685 })

Push it to huggingface

dataset.push_to_hub(dataset_name)

On the hugging face dataset repo, the dataset then appears to be splited:

Indeed, when loading the dataset from this repo, the dataset is split in two testing and training set.

from datasets import load_dataset, Dataset

dataset = load_dataset("Jetlime/NF-CSE-CIC-IDS2018-v2", streaming=True)
dataset

output:

IterableDatasetDict({
    train: IterableDataset({
        features: ['input', 'output', 'Attack', '__index_level_0__'],
        n_shards: 2
    })
    test: IterableDataset({
        features: ['input', 'output', 'Attack', '__index_level_0__'],
        n_shards: 1
    })

Expected behavior

The dataset shall not be splited, as not requested.

Environment info

datasets version: 2.19.1
Platform: Linux-6.2.0-35-generic-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.23.0
PyArrow version: 15.0.2
Pandas version: 2.2.2
fsspec version: 2024.3.1