Q: Can I load local files without uploading them to cloud storage?
Spico197 opened this issue Β· comments
π Feature Request
I'm wondering if it's possible to directly load files from local folders with a lot of jsonl
files without uploading them to the cloud storage.
Motivation
My dataset is not too large (but they are split into multiple files in different sub-folders though), which is capable of local loading. So I'd like to load them directly instead of uploading them to a cloud storage first. Is this framework capable of such features?
Thanks a lot for your response~
[Optional] Implementation
Additional context
To stream your dataset in from the cloud:
dataset = StreamingDataset(remote='s3://path/to/dataset', local='/path/to/cache', batch_size=...)
To copy your dataset in from elsewhere on the filesystem:
dataset = StreamingDataset(remote='/path/to/dataset', local='/path/to/cache', batch_size=...)
To iterate over a dataset in-place:
dataset = StreamingDataset(local='/path/to/dataset', batch_size=...)
Speaking of JSONL, please note that Streaming requires its own vertically integrated "index + shards" serialization format. The key feature of this format is instant global random access to samples (when cached and decompressed, etc.), a foundation upon which we have built some nice things like powerful shuffling.
One of the supported types of shards in Streaming is "Streaming JSONL" which pairs each normal JSONL file with a second metadata file which is basically np.array(byte_offset_of_each_line, np.uint32)
, through which we get instant disk-based random access while still using JSONL.
Thank you very much for the examples and the thorough explanation! This is so great~