Large .beton files slow down or even freeze learning during loading [possible bug]

Question

Large .beton files slow down or even freeze learning during loading [possible bug]

SerezD opened this issue 7 months ago · comments

Hello, I have created two different versions of the imagenet .beton file.

The code is the following:

from torch.utils.data import Dataset
from PIL import Image
from ffcv.fields import RGBImageField
from ffcv.writer import DatasetWriter

# custom torch Image Dataset object 
class ImageDataset(Dataset):

    def __init__(self, folder: str):
        """
        :param folder: path to images
        """

        self.samples = sorted(list(pathlib.Path(folder).rglob('*.png')) + list(pathlib.Path(folder).rglob('*.jpg')) +
                              list(pathlib.Path(folder).rglob('*.bmp')) + list(pathlib.Path(folder).rglob('*.JPEG')))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):

        # path to string
        image_path = self.samples[idx].absolute().as_posix()

        image = Image.open(image_path).convert('RGB')
        return (image,)

# main part
final_path = ....  # final path for beton file
data_folder = ....  # path to images 

# create dataset
dataset = ImageDataset(folder=data_folder)

# create writer [VERSION 1]
writer = DatasetWriter(final_path, {
    'image': RGBImageField(write_mode='jpg', max_resolution=256),
}, num_workers=8)

# create writer [VERSION 2]
# writer = DatasetWriter(final_path, {
#     'image': RGBImageField(write_mode='jpg'),
# }, num_workers=8)

writer.from_indexed_dataset(dataset)

The only difference between the two versions is the "max_resolution" parameter.

The two datasets are correctly created, and version1.beton is approx 20 GB, while version2.beton is approx 80 GB.

At loading time:

from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder, FloatDecoder

path = ....  # path for beton file versionx.beton
batch_size = ... # tried different values
os_cache = ... # tried both

loader = Loader(path,
                batch_size=batch_size,
                num_workers=8,
                order=OrderOption.RANDOM,
                pipelines=[
                    [
                        CenterCropRGBImageDecoder((image_size, image_size), ratio=1),
                        ToTensor(),
                        ToTorchImage()
                    ]
                ],
                os_cache=os_cache)

With version 1 everything works fine. Depending on the batch size and os_cache params, training may be faster or slower, but everything seems ok.

With version 2 80 GB, if I set batch size as high as possible, the training is very slow and may even freeze the machine completely.

By monitoring resources, I noticed a high cpu ram usage (up to 100% right before freezing). I have tried both params os_cache = True and os_cache = False with the latter freezing the machine even before training starts. With os_cache = True, usually a couple of batches are loaded and the first epoch steps are done before freezing.

I have reproduced the bug on two different machines, with different operating systems, GPUS and hardware, so I don't think this is machine-related.

Charles Hill · Answer 1 · Wed Feb 07 2024 12:02:22 GMT+0800 (China Standard Time)

Using order=OrderOption.RANDOM will attempt to load all the data in memory to do "perfect" shuffling. Using OrderOption.QUASI_RANDOM will keep a much smaller amount of the data in memory while still allowing for shuffling, although it doesn't work (for now) in distributed settings. Does using this other option for the order resolve the issue?

Dario Serez · Answer 2 · Mon May 06 2024 20:31:43 GMT+0800 (China Standard Time)

Hi @charlesjhill , thank you for your suggestion and sorry for the late reply.

I have now tried to run some tests in a new environment, running latest ffcv and torch versions. Surprisingly, I am not able to reproduce the behavior described above.

Here is the code to create beton files (the only difference is the max_resolution argument):

from torch.utils.data import Dataset
from PIL import Image
from ffcv.fields import RGBImageField
from ffcv.writer import DatasetWriter
import pathlib

# custom torch Image Dataset object 
class ImageDataset(Dataset):

    def __init__(self, folder: str):
        """
        :param folder: path to images
        """

        self.samples = sorted(list(pathlib.Path(folder).rglob('*.png')) + list(pathlib.Path(folder).rglob('*.jpg')) +
                              list(pathlib.Path(folder).rglob('*.bmp')) + list(pathlib.Path(folder).rglob('*.JPEG')))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):

        # path to string
        image_path = self.samples[idx].absolute().as_posix()

        image = Image.open(image_path).convert('RGB')
        return (image,)

# main part
final_path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton'  # final path for beton file
data_folder = '~/Documents/datasets/imagenet/train/'  # path to images 

# create dataset
dataset = ImageDataset(folder=data_folder)

# create writer [VERSION 1]
# writer = DatasetWriter(final_path, {
#     'image': RGBImageField(write_mode='jpg', max_resolution=256),
# }, num_workers=8)

# create writer [VERSION 2]
writer = DatasetWriter(final_path, {
    'image': RGBImageField(write_mode='jpg'),
}, num_workers=8)

writer.from_indexed_dataset(dataset)

Then, I am loading images (testing the two versions):

from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage
from ffcv.fields.rgb_image import CenterCropRGBImageDecoder
import torch
import time


path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton'  # path for beton file versionx.beton
batch_size = 128 # tried different values

# BATCH SIZE 
os_cache = False # tried both
order = OrderOption.RANDOM  # try OrderOption.RANDOM or OrderOption.QUASI_RANDOM
image_size = 256

loader = Loader(path,
                batch_size=batch_size,
                num_workers=8,
                order=order,
                pipelines={
                    'image':
                    [
                        CenterCropRGBImageDecoder((image_size, image_size), ratio=1.),
                        ToTensor(),
                        ToDevice(torch.device(0), non_blocking=True),
                        ToTorchImage(),
                    ]
                },
                os_cache=os_cache)

print(f'Testing with Batch Size = {batch_size}')
print(f'Testing with Order = {order}')

start = time.time()
for i, batch in enumerate(loader):
    
    images = batch[0]
    print(f'{i}: {images.shape}')
    if i == 15:
        break

print(f'Duration: {time.time() - start}')

When running the version with the max size at 256, this is the output:

Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
0: torch.Size([128, 3, 256, 256])
1: torch.Size([128, 3, 256, 256])
2: torch.Size([128, 3, 256, 256])
3: torch.Size([128, 3, 256, 256])
4: torch.Size([128, 3, 256, 256])
5: torch.Size([128, 3, 256, 256])
6: torch.Size([128, 3, 256, 256])
7: torch.Size([128, 3, 256, 256])
8: torch.Size([128, 3, 256, 256])
9: torch.Size([128, 3, 256, 256])
10: torch.Size([128, 3, 256, 256])
11: torch.Size([128, 3, 256, 256])
12: torch.Size([128, 3, 256, 256])
13: torch.Size([128, 3, 256, 256])
14: torch.Size([128, 3, 256, 256])
15: torch.Size([128, 3, 256, 256])
Duration: 11.110697746276855

Previously, running the same code with the "bugged beton" would cause my machine to freeze. Now, it raises a Runtime Error

Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
Traceback (most recent call last):
  File "/.../ffcv_debug/run_loader.py", line 35, in <module>
    for i, batch in enumerate(loader):
                    ^^^^^^^^^^^^^^^^^
  File "/.../python3.11/site-packages/ffcv/loader/loader.py", line 226, in __iter__
    return EpochIterator(self, selected_order)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../python3.11/site-packages/ffcv/loader/epoch_iterator.py", line 65, in __init__
    self.memory_allocations = self.loader.graph.allocate_memory(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 370, in allocate_memory
    allocated_buffer = tuple(
                       ^^^^^^
  File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 371, in <genexpr>
    allocate_query(q, batch_size, batches_ahead) for q in memory_allocation
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../python3.11/site-packages/ffcv/pipeline/allocation_query.py", line 35, in allocate_query
    result = ch.empty(*final_shape,
             ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 116988345600 bytes. Error code 12 (Cannot allocate memory)

Apparently, some behavior changed and now there is an extra check that prevents allocating too much RAM memory. However, I do not have access to the previous versions of the code anymore, so I can't double check what specifically changes between the two versions.

Anyway, I guess the issue can be closed