mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

Home Page:https://streaming.docs.mosaicml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Last entry in the dataset is causing "Relative sample index $x is not present" error

isidentical opened this issue · comments

Environment

  • OS: [Ubuntu 20.04]
  • Hardware (GPU, or instance type): [H100]

When I try to load a big dataset with ~thousands of shards (each shard is ~1GB), on some of those shards I get the following error:

[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
[rank5]:     return self.get_item(at)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/dataset.py", line 1235, in get_item
[rank5]:     sample = shard[shard_sample_id]
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
[rank5]:     return self.get_item(at)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/format/base/reader.py", line 319, in get_item
[rank5]:     data = self.get_sample_data(idx)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/format/mds/reader.py", line 145, in get_sample_data
[rank5]:     raise IndexError(
[rank5]: IndexError: Relative sample index 5205 is not present in the shard.01000.mds file.

But when looking into the actual data files, the shards themselves seem correct (~1 Gig each, where everything can be indexed properly except the last item). Here is that shard's index:

{'column_encodings': ['str', 'jpeg', 'str', 'np16', 'uint8', 'np16'],
 'column_names': ['caption',
                  'image',
                  'key',
                  'sscd_embeddings',
                  't5_xl_embeddings',
                  'vae_256x256_latents'],
 'column_sizes': [None, None, None, None, None, None],
 'compression': None,
 'format': 'mds',
 'hashes': [],
 'raw_data': {'basename': 'shard.01000.mds', 'bytes': 1073575555, 'hashes': {}},
 'samples': 5206,
 'size_limit': 1073741824,
 'version': 2,
 'zip_data': None}

As you can see, the index says there are 5206 samples. Which makes the sample index 5205 the last item. When I read the sample index manually, I see the following values:

>>> filename = "shard.01000.mds"
>>> offset = (1 + 5205) * 4
>>> with open(filename, 'rb', 0) as fp:
...     fp.seek(offset)
...     pair = fp.read(8)
...     begin, end = np.frombuffer(pair, np.uint32)
>>> begin, end = np.frombuffer(pair, np.uint32)
1073575555
>>> end
1868767867
>>> end - begin
795192312 (invalid value)

The problem is there is nothing after 1073575555:

>>> with open(filename, 'rb', 0) as fp:
...     fp.seek(1073575555)
...     data = fp.read()
... 
1073575555
>>> data
b''

I am assuming this happened because the sample didn't fit to the size limit but still got counted towards this index (since size_limit - 1073575555 is too smoll to fit anything), somehow? In either case, this seems to be made the dataset unusable. Will try to manually fix the index but just making you aware this is a problem.

This is how I was able to solve the issue:

import json
import numpy as np
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed


datasets = [...]


def check_shard(shard):
    shard_f = dataset / shard["raw_data"]["basename"]
    offset = shard["samples"] * 4
    with open(shard_f, "rb", 0) as fp:
        fp.seek(offset)
        pair = fp.read(8)
        begin, end = np.frombuffer(pair, np.uint32)
    shard_size = shard["raw_data"]["bytes"]
    return shard_size >= end


with ThreadPoolExecutor(max_workers=64) as executor:
    for dataset in datasets:
        index_f = dataset / "index.json"
        index = json.loads(index_f.read_text())

        futures = []
        for shard in index["shards"]:
            futures.append(executor.submit(check_shard, shard))

        success = True
        for future in as_completed(futures):
            shard_id = futures.index(future)
            shard = index["shards"][shard_id]
            if not future.result():
                print(
                    f"Shard {dataset / shard['raw_data']['basename']} is not properly indexed"
                )
                # For some reason, 1 is not enough since there are datasets with more than a couple
                # missing samples.
                if shard["samples"] <= 10:
                    print("ALERT!!!")
                    continue
                shard["samples"] -= 3
                success = False
                print(shard)

        if not success:
            print("Formatting index...")
            index_f.write_text(json.dumps(index))
        print(f"Dataset {dataset} is OK")

Hmm...this does seem like a bug on our side, although I'm not sure why this would be the case. Do you have a simple script that can reproduce this shard writing issue? Also, does increasing the size_limit help in this case?

+1 i am also getting this issue but the above posted snippet solves it 😄 , the only hassle was i had to download all the shards locally and then run the above snippet (which kinda defeats the whole purpose of streaming 😢 ). My setup is the same as OP described above