```push_to_hub()``` - Prevent Automatic Generation of Splits
jetlime opened this issue · comments
Describe the bug
I currently have a dataset which has not been splited. When pushing the dataset to my hugging face dataset repository, it is split into a testing and training set. How can I prevent the split from happening?
Steps to reproduce the bug
- Have a unsplit dataset
Dataset({ features: ['input', 'output', 'Attack', '__index_level_0__'], num_rows: 944685 })
- Push it to huggingface
dataset.push_to_hub(dataset_name)
- On the hugging face dataset repo, the dataset then appears to be splited:
- Indeed, when loading the dataset from this repo, the dataset is split in two testing and training set.
from datasets import load_dataset, Dataset
dataset = load_dataset("Jetlime/NF-CSE-CIC-IDS2018-v2", streaming=True)
dataset
output:
IterableDatasetDict({
train: IterableDataset({
features: ['input', 'output', 'Attack', '__index_level_0__'],
n_shards: 2
})
test: IterableDataset({
features: ['input', 'output', 'Attack', '__index_level_0__'],
n_shards: 1
})
Expected behavior
The dataset shall not be splited, as not requested.
Environment info
datasets
version: 2.19.1- Platform: Linux-6.2.0-35-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
huggingface_hub
version: 0.23.0- PyArrow version: 15.0.2
- Pandas version: 2.2.2
fsspec
version: 2024.3.1