mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

Home Page:https://streaming.docs.mosaicml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shard maximum size should be 4GB for MDS

smspillaz opened this issue · comments

To reproduce

Steps to reproduce the behavior:

  1. Pack more than 4GB of uncompressed data into an MDS shard without size_limit being set.
  2. Try to read it later
  3. Corruption due to unsigned integer overflow.

Expected behavior

The maximum size for MDS shards should be at most 4GB.

Additional context

I haven't made an exact reproducer yet, but after some inspection of shards where the uncompressed size was around 5GB, I think the issue is here:

  1. When encoding an MDS shard, the offsets get written in the header of the file here: https://github.com/mosaicml/streaming/blob/main/streaming/base/format/mds/writer.py#L141 . There is no check for unsigned integer overflow, so if the cumulative sum of sizes exceeds 2 ** 32, then we get integer overflow and offsets wraps around
  2. When reading we blindly read those offsets https://github.com/mosaicml/streaming/blob/main/streaming/base/format/mds/reader.py#L141 and trust them, then read the data. If there is overflow, we will loop back to the beginning of the uncompressed data and read invalid content.

This can probably be fixed by just capping the maximum size of an MDS shard to 4GB. Perhaps the shard formats should have some kind of cap. Right now there is no warning if no size limit is given.

Or offsets should be stored as uint64 with a cap of shard size limit of 18exabytes but that would require bumping the file version.

Ah, nevermind, its a dupe of a bug filed elsewhere #671