Large .beton files slow down or even freeze learning during loading [possible bug]
SerezD opened this issue · comments
Hello, I have created two different versions of the imagenet .beton file.
The code is the following:
from torch.utils.data import Dataset
from PIL import Image
from ffcv.fields import RGBImageField
from ffcv.writer import DatasetWriter
# custom torch Image Dataset object
class ImageDataset(Dataset):
def __init__(self, folder: str):
"""
:param folder: path to images
"""
self.samples = sorted(list(pathlib.Path(folder).rglob('*.png')) + list(pathlib.Path(folder).rglob('*.jpg')) +
list(pathlib.Path(folder).rglob('*.bmp')) + list(pathlib.Path(folder).rglob('*.JPEG')))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
# path to string
image_path = self.samples[idx].absolute().as_posix()
image = Image.open(image_path).convert('RGB')
return (image,)
# main part
final_path = .... # final path for beton file
data_folder = .... # path to images
# create dataset
dataset = ImageDataset(folder=data_folder)
# create writer [VERSION 1]
writer = DatasetWriter(final_path, {
'image': RGBImageField(write_mode='jpg', max_resolution=256),
}, num_workers=8)
# create writer [VERSION 2]
# writer = DatasetWriter(final_path, {
# 'image': RGBImageField(write_mode='jpg'),
# }, num_workers=8)
writer.from_indexed_dataset(dataset)
The only difference between the two versions is the "max_resolution" parameter.
The two datasets are correctly created, and version1.beton
is approx 20 GB
, while version2.beton
is approx 80 GB
.
At loading time:
from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder, FloatDecoder
path = .... # path for beton file versionx.beton
batch_size = ... # tried different values
os_cache = ... # tried both
loader = Loader(path,
batch_size=batch_size,
num_workers=8,
order=OrderOption.RANDOM,
pipelines=[
[
CenterCropRGBImageDecoder((image_size, image_size), ratio=1),
ToTensor(),
ToTorchImage()
]
],
os_cache=os_cache)
With version 1 everything works fine. Depending on the batch size and os_cache params, training may be faster or slower, but everything seems ok.
With version 2 80 GB
, if I set batch size as high as possible, the training is very slow and may even freeze the machine completely.
By monitoring resources, I noticed a high cpu ram usage (up to 100% right before freezing). I have tried both params os_cache = True
and os_cache = False
with the latter freezing the machine even before training starts. With os_cache = True
, usually a couple of batches are loaded and the first epoch steps are done before freezing.
I have reproduced the bug on two different machines, with different operating systems, GPUS and hardware, so I don't think this is machine-related.
Using order=OrderOption.RANDOM
will attempt to load all the data in memory to do "perfect" shuffling. Using OrderOption.QUASI_RANDOM
will keep a much smaller amount of the data in memory while still allowing for shuffling, although it doesn't work (for now) in distributed settings. Does using this other option for the order resolve the issue?
Hi @charlesjhill , thank you for your suggestion and sorry for the late reply.
I have now tried to run some tests in a new environment, running latest ffcv
and torch
versions. Surprisingly, I am not able to reproduce the behavior described above.
Here is the code to create beton files (the only difference is the max_resolution
argument):
from torch.utils.data import Dataset
from PIL import Image
from ffcv.fields import RGBImageField
from ffcv.writer import DatasetWriter
import pathlib
# custom torch Image Dataset object
class ImageDataset(Dataset):
def __init__(self, folder: str):
"""
:param folder: path to images
"""
self.samples = sorted(list(pathlib.Path(folder).rglob('*.png')) + list(pathlib.Path(folder).rglob('*.jpg')) +
list(pathlib.Path(folder).rglob('*.bmp')) + list(pathlib.Path(folder).rglob('*.JPEG')))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
# path to string
image_path = self.samples[idx].absolute().as_posix()
image = Image.open(image_path).convert('RGB')
return (image,)
# main part
final_path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton' # final path for beton file
data_folder = '~/Documents/datasets/imagenet/train/' # path to images
# create dataset
dataset = ImageDataset(folder=data_folder)
# create writer [VERSION 1]
# writer = DatasetWriter(final_path, {
# 'image': RGBImageField(write_mode='jpg', max_resolution=256),
# }, num_workers=8)
# create writer [VERSION 2]
writer = DatasetWriter(final_path, {
'image': RGBImageField(write_mode='jpg'),
}, num_workers=8)
writer.from_indexed_dataset(dataset)
Then, I am loading images (testing the two versions):
from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage
from ffcv.fields.rgb_image import CenterCropRGBImageDecoder
import torch
import time
path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton' # path for beton file versionx.beton
batch_size = 128 # tried different values
# BATCH SIZE
os_cache = False # tried both
order = OrderOption.RANDOM # try OrderOption.RANDOM or OrderOption.QUASI_RANDOM
image_size = 256
loader = Loader(path,
batch_size=batch_size,
num_workers=8,
order=order,
pipelines={
'image':
[
CenterCropRGBImageDecoder((image_size, image_size), ratio=1.),
ToTensor(),
ToDevice(torch.device(0), non_blocking=True),
ToTorchImage(),
]
},
os_cache=os_cache)
print(f'Testing with Batch Size = {batch_size}')
print(f'Testing with Order = {order}')
start = time.time()
for i, batch in enumerate(loader):
images = batch[0]
print(f'{i}: {images.shape}')
if i == 15:
break
print(f'Duration: {time.time() - start}')
When running the version with the max size at 256, this is the output:
Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
0: torch.Size([128, 3, 256, 256])
1: torch.Size([128, 3, 256, 256])
2: torch.Size([128, 3, 256, 256])
3: torch.Size([128, 3, 256, 256])
4: torch.Size([128, 3, 256, 256])
5: torch.Size([128, 3, 256, 256])
6: torch.Size([128, 3, 256, 256])
7: torch.Size([128, 3, 256, 256])
8: torch.Size([128, 3, 256, 256])
9: torch.Size([128, 3, 256, 256])
10: torch.Size([128, 3, 256, 256])
11: torch.Size([128, 3, 256, 256])
12: torch.Size([128, 3, 256, 256])
13: torch.Size([128, 3, 256, 256])
14: torch.Size([128, 3, 256, 256])
15: torch.Size([128, 3, 256, 256])
Duration: 11.110697746276855
Previously, running the same code with the "bugged beton" would cause my machine to freeze. Now, it raises a Runtime Error
Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
Traceback (most recent call last):
File "/.../ffcv_debug/run_loader.py", line 35, in <module>
for i, batch in enumerate(loader):
^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/loader/loader.py", line 226, in __iter__
return EpochIterator(self, selected_order)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/loader/epoch_iterator.py", line 65, in __init__
self.memory_allocations = self.loader.graph.allocate_memory(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 370, in allocate_memory
allocated_buffer = tuple(
^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 371, in <genexpr>
allocate_query(q, batch_size, batches_ahead) for q in memory_allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/allocation_query.py", line 35, in allocate_query
result = ch.empty(*final_shape,
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 116988345600 bytes. Error code 12 (Cannot allocate memory)
Apparently, some behavior changed and now there is an extra check that prevents allocating too much RAM memory. However, I do not have access to the previous versions of the code anymore, so I can't double check what specifically changes between the two versions.
Anyway, I guess the issue can be closed