Tool that creates random data splits from text files, where data is organized on a line-by-line basis. Splits are to be defined in percentages of lines. Input data can either be a local text file or a URL to a text file. If an URL is provided the data will be downloaded (if the file is not found in the cache directory). Optionally, the user can specify to ignore cached files. For reproducibility reasons a random seed is set prior to drawing the random numbers to select the lines from the file.
conda install --yes --file requirements.txt
or
pip install -r requirements.txt
python create_split.py --input_file_or_path https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt --split_percentage 1
Output will be the file called:
wiki1m_for_simcse_001.00percent_seed48.txt
python create_split.py --input_file_or_path wikipedia-en --split_samples 1000
Output will be the file called:
wikipedia-en_1K_samples_seed48.txt
python create_split.py --input_file https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt \
--split_percentage 1 \
--ignore_cache