Support large size index.json (20GB +)

Question

Support large size index.json (20GB +)

andreamad8 opened this issue 3 months ago · comments

🚀 Feature Request

Large index.json are slow to load. Currently, I am trying to increase shard size, so stream.py#L473 will be faster (hopefully).

Motivation

These two steps are very slow for large index.json files.

https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L461

and

https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L473

especially with large scale dataset (e.g, Billion same).

Alex Schneidman · Answer 1 · Fri Apr 26 2024 02:25:36 GMT+0800 (China Standard Time)

Some more context, we have a dataset with ~1.2 billion samples at like 1MB/sample. The index.json file of the merged dataset will be in the tens of GBs, which makes the dataset prohibitively slow to initialize.

Saaketh Narayan · Answer 2 · Thu May 09 2024 04:23:38 GMT+0800 (China Standard Time)

Hey, we have seen index.json load times be slow. I think that this is because we download the index file on every single rank, rather than downloading it on just one rank and then broadcasting its contents to other ranks. Downloading a file that's a few GB from cloud storage just on one rank should be relatively fast. This would be a good enhancement but isn't high priority for us right now -- if it's not too much of a hassle, mind submitting a PR?