Possible out-of-memory issue of dataloader

Question

Possible out-of-memory issue of dataloader

zhiqiangdon opened this issue 2 years ago · comments

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

Wonjae Kim · Answer 1 · Mon Nov 22 2021 22:12:02 GMT+0800 (China Standard Time)

Hi @zhiqiangdon,

Apache Arrow's read_all() function is actually doing a lazy loading, so there will be no OOM issue.
Though if you call the .to_pandas() method, then Arrow will load the dataset eagerly and you will face the OOM issue.

Zhiqiang Tang · Answer 2 · Tue Nov 23 2021 03:42:56 GMT+0800 (China Standard Time)

Thanks @dandelin,

I see that you call .to_pandas() on the text column:
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58
I guess this operation doesn't load the image data, right?

Wonjae Kim · Answer 3 · Tue Nov 23 2021 19:57:20 GMT+0800 (China Standard Time)

@zhiqiangdon

Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.

Zhiqiang Tang · Answer 4 · Wed Nov 24 2021 09:13:15 GMT+0800 (China Standard Time)

Thanks @dandelin!