[QUESTION] Ask about some detailed questions regarding the shuffle algorithm in official website.

Question

[QUESTION] Ask about some detailed questions regarding the shuffle algorithm in official website.

yanghua opened this issue a month ago · comments

The URL link: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shuffling.html

Regarding: py1e,
why does "Samples from each shard are spread out across a range of maximum size shuffle_block_size." lead to "reducing the maximum needed cache limit and better balancing shard downloads"?

For comparison: py1b and py1br
"This algorithm is very similar to py1br, without randomizing shuffle block sizes, resulting in suboptimal download performance." How should this be understood?

Saaketh Narayan · Answer 1 · Fri Jun 14 2024 07:38:52 GMT+0800 (China Standard Time)

Hey @yanghua, great question. The main reason is because of predownloading behavior. StreamingDataset allows workers to predownload shards for samples that are a bit ahead of the current batch in order to make sure that the shards are available locally when needed. With py1b and py1br, we use "shuffle-block" based shuffling, which essentially means that we denote a group of shards to be in the same shuffle block, and shuffle all the samples between these shards. For example, if my shuffle block size encompasses 4 shard files, which each contain 100 samples, then my shuffle block size is 400 samples. And the samples from each of the 4 shard files will be shuffled together. Notice that at any point in training, a global batch can consist of samples from 4 shards at a time.

The py1e shuffle algorithm, in contrast, essentially distributes the samples from each shard over a range that's equal to the shuffle block size. So using the same example, each shard's samples will be scattered across a range of 400 samples. If we do this for every shard, at any point in training, we maintain that each global batch can consist of samples from ~4 shards at a time.

Now, consider what happens when predownloading samples with py1b or py1br. As training approaches the end of the current shuffle block, the predownload range will spill over into the new shuffle block. During training, the current shuffle block requires 4 shards to be present since the global batches draw from 4 shards at a time, but the batches in the next shuffle block also require 4 different shards to be present. This results in a maximum cache limit of 8 shard files.

In py1e, because we have distributed each shard's samples over a range, there are no hard boundaries where predownloading forces us to download 4 extra shards. The number of shards needed for training is always just ~4 shards, even with predownloading.

I hope this makes more sense! The diagrams in the documentation are useful to visualize this as well.

vinoyang · Answer 2 · Fri Jun 14 2024 16:20:18 GMT+0800 (China Standard Time)

Hi @snarayan21 Thanks for your kind explanation. I think I have understood.

Can we analogy shuffle block to a window in data processing semantics?

py1e's range mode looks like a sliding window.

py1b and py1br looks like a Tumbling(fixed) window.

Saaketh Narayan · Answer 3 · Fri Jun 14 2024 23:34:40 GMT+0800 (China Standard Time)

@yanghua I'm not familiar with those terms, but if that helps as an analogy, sure :).

Closing this issue now.