buffer is too small for requested array
MLlove0402 opened this issue · comments
Hi, thanks you for great works.
I'm facing a problem. When run code Test the Dataloader, i have issue below
Convert C4 dataset to StreamingDataset format
python3 scripts/data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' --compression zstd
Test the Dataloader
python3 llmfoundry/data/text_data.py --local_path my-copy-c4/ --split train_small
Hey @MLlove0402, I just now ran the same command on my local and it ran successfully. First, I converted the dataset to MDS and then iterate those dataset.
$ python3 llmfoundry/data/text_data.py --local_path my-copy-c4/ --split train_small
/usr/lib/python3/dist-packages/pydantic/_internal/_fields.py:149: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".
You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
warnings.warn(
/usr/lib/python3/dist-packages/pydantic/_internal/_config.py:321: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
warnings.warn(message, UserWarning)
Reading train_small split from my-copy-c4/
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:392: UserWarning: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
warnings.warn(f'Because `predownload` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
#################### Batch 0 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
-------------------- Sample 0 --------------------
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put
-------------------- Sample 1 --------------------
to let HIM fulfill HIS plan in your life.
SOMETHING NEW THAT WILL HELP YOU HEAR GODS VOICE!
How to Under
#################### Batch 1 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
-------------------- Sample 0 --------------------
more easily and move it around securely.
The pan can withstand heat of up to 450 F, and I think that this should be sufficient for most baking jobs
-------------------- Sample 1 --------------------
Genisys was [spoiler alert] ….. indeed half of the inspiration for this convo. I saw that article about Alan Taylor’s
#################### Batch 2 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
-------------------- Sample 0 --------------------
Quartz in massive form with Crocidolite inclusions.
Cat's Eye Quartz - Quartz with dense, tiny Rutile inclusions that cause a cat
-------------------- Sample 1 --------------------
time. It also could expose the entire home, office and other such places to unwanted risks. Further, you also might need to make some changes to your entire
#################### Batch 3 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
-------------------- Sample 0 --------------------
trial period has ended.
Is there any solution to this problem? we need it to be as secure as it can without being open to abuse and easy to
-------------------- Sample 1 --------------------
Arrangements by WOODLAWN FUNERAL HOME, 383-4754 Gallatin, TN MARTIN, Allen Palmer House- Age 95
#################### Batch 4 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
-------------------- Sample 0 --------------------
-on” laborers needed for take down of Assembly for Children props and décor.<|endoftext|>Jomsom Muktinath Trek is an exciting trek in the
-------------------- Sample 1 --------------------
kids.
Lendoiro is on war against the clubs that haven’t paid their obligations to Deportivo; there’s a well-known battle
What version of streaming-dataset
are you using? Can you try cleaning the stale shared memory to see if it helps using the below command?
import streaming
# clean stale shared memory if any
streaming.base.util.clean_stale_shared_memory()
Very Thanks @karan6181, i try clean shared memory and it work.