mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

buffer is too small for requested array

MLlove0402 opened this issue · comments

Hi, thanks you for great works.
I'm facing a problem. When run code Test the Dataloader, i have issue below
image

Convert C4 dataset to StreamingDataset format

python3 scripts/data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' --compression zstd

Test the Dataloader

python3 llmfoundry/data/text_data.py --local_path my-copy-c4/ --split train_small

Hey @MLlove0402, I just now ran the same command on my local and it ran successfully. First, I converted the dataset to MDS and then iterate those dataset.

$ python3 llmfoundry/data/text_data.py --local_path my-copy-c4/ --split train_small
/usr/lib/python3/dist-packages/pydantic/_internal/_fields.py:149: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/usr/lib/python3/dist-packages/pydantic/_internal/_config.py:321: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
  warnings.warn(message, UserWarning)
Reading train_small split from my-copy-c4/
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:392: UserWarning: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
  warnings.warn(f'Because `predownload` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
  warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
  warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
  warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
  warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
  warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
  warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:677: UserWarning: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
  warnings.warn(f'Because `num_canonical_nodes` was not specified, and ' +
/mnt/workdisk/karan/karan_streaming/streaming/base/dataset.py:650: UserWarning: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
  warnings.warn(f'Because `shuffle_block_size` was not specified, it will default to ' +


#################### Batch 0 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put
--------------------  Sample 1  --------------------
 to let HIM fulfill HIS plan in your life.
SOMETHING NEW THAT WILL HELP YOU HEAR GODS VOICE!
How to Under


#################### Batch 1 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
 more easily and move it around securely.
The pan can withstand heat of up to 450 F, and I think that this should be sufficient for most baking jobs
--------------------  Sample 1  --------------------

Genisys was [spoiler alert] ….. indeed half of the inspiration for this convo. I saw that article about Alan Taylor’s


#################### Batch 2 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
 Quartz in massive form with Crocidolite inclusions.
Cat's Eye Quartz - Quartz with dense, tiny Rutile inclusions that cause a cat
--------------------  Sample 1  --------------------
 time. It also could expose the entire home, office and other such places to unwanted risks. Further, you also might need to make some changes to your entire


#################### Batch 3 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
 trial period has ended.
Is there any solution to this problem? we need it to be as secure as it can without being open to abuse and easy to
--------------------  Sample 1  --------------------

Arrangements by WOODLAWN FUNERAL HOME, 383-4754 Gallatin, TN MARTIN, Allen Palmer House- Age 95


#################### Batch 4 ####################
input_ids torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
-on” laborers needed for take down of Assembly for Children props and décor.<|endoftext|>Jomsom Muktinath Trek is an exciting trek in the
--------------------  Sample 1  --------------------
 kids.
Lendoiro is on war against the clubs that haven’t paid their obligations to Deportivo; there’s a well-known battle

What version of streaming-dataset are you using? Can you try cleaning the stale shared memory to see if it helps using the below command?

import streaming

# clean stale shared memory if any
streaming.base.util.clean_stale_shared_memory()

Very Thanks @karan6181, i try clean shared memory and it work.