huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

```push_to_hub()``` - Prevent Automatic Generation of Splits

jetlime opened this issue · comments

Describe the bug

I currently have a dataset which has not been splited. When pushing the dataset to my hugging face dataset repository, it is split into a testing and training set. How can I prevent the split from happening?

Steps to reproduce the bug

  1. Have a unsplit dataset
Dataset({ features: ['input', 'output', 'Attack', '__index_level_0__'], num_rows: 944685 })
  1. Push it to huggingface
dataset.push_to_hub(dataset_name)
  1. On the hugging face dataset repo, the dataset then appears to be splited:

image

  1. Indeed, when loading the dataset from this repo, the dataset is split in two testing and training set.
from datasets import load_dataset, Dataset

dataset = load_dataset("Jetlime/NF-CSE-CIC-IDS2018-v2", streaming=True)
dataset

output:

IterableDatasetDict({
    train: IterableDataset({
        features: ['input', 'output', 'Attack', '__index_level_0__'],
        n_shards: 2
    })
    test: IterableDataset({
        features: ['input', 'output', 'Attack', '__index_level_0__'],
        n_shards: 1
    })

Expected behavior

The dataset shall not be splited, as not requested.

Environment info

  • datasets version: 2.19.1
  • Platform: Linux-6.2.0-35-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.23.0
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1