Shard maximum size should be 4GB for MDS
smspillaz opened this issue · comments
To reproduce
Steps to reproduce the behavior:
- Pack more than 4GB of uncompressed data into an MDS shard without
size_limit
being set. - Try to read it later
- Corruption due to unsigned integer overflow.
Expected behavior
The maximum size for MDS shards should be at most 4GB.
Additional context
I haven't made an exact reproducer yet, but after some inspection of shards where the uncompressed size was around 5GB, I think the issue is here:
- When encoding an MDS shard, the offsets get written in the header of the file here: https://github.com/mosaicml/streaming/blob/main/streaming/base/format/mds/writer.py#L141 . There is no check for unsigned integer overflow, so if the cumulative sum of
sizes
exceeds2 ** 32
, then we get integer overflow andoffsets
wraps around - When reading we blindly read those offsets https://github.com/mosaicml/streaming/blob/main/streaming/base/format/mds/reader.py#L141 and trust them, then read the data. If there is overflow, we will loop back to the beginning of the uncompressed data and read invalid content.
This can probably be fixed by just capping the maximum size of an MDS shard to 4GB. Perhaps the shard formats should have some kind of cap. Right now there is no warning if no size limit is given.
Or offsets should be stored as uint64 with a cap of shard size limit of 18exabytes but that would require bumping the file version.
Ah, nevermind, its a dupe of a bug filed elsewhere #671