mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

Home Page:https://streaming.docs.mosaicml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Q: Can I load local files without uploading them to cloud storage?

Spico197 opened this issue Β· comments

πŸš€ Feature Request

I'm wondering if it's possible to directly load files from local folders with a lot of jsonl files without uploading them to the cloud storage.

Motivation

My dataset is not too large (but they are split into multiple files in different sub-folders though), which is capable of local loading. So I'd like to load them directly instead of uploading them to a cloud storage first. Is this framework capable of such features?

Thanks a lot for your response~

[Optional] Implementation

Additional context

To stream your dataset in from the cloud:

dataset = StreamingDataset(remote='s3://path/to/dataset', local='/path/to/cache', batch_size=...)

To copy your dataset in from elsewhere on the filesystem:

dataset = StreamingDataset(remote='/path/to/dataset', local='/path/to/cache', batch_size=...)

To iterate over a dataset in-place:

dataset = StreamingDataset(local='/path/to/dataset', batch_size=...)

Speaking of JSONL, please note that Streaming requires its own vertically integrated "index + shards" serialization format. The key feature of this format is instant global random access to samples (when cached and decompressed, etc.), a foundation upon which we have built some nice things like powerful shuffling.

One of the supported types of shards in Streaming is "Streaming JSONL" which pairs each normal JSONL file with a second metadata file which is basically np.array(byte_offset_of_each_line, np.uint32), through which we get instant disk-based random access while still using JSONL.

Thank you very much for the examples and the thorough explanation! This is so great~