dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible out-of-memory issue of dataloader

zhiqiangdon opened this issue · comments

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

Hi @zhiqiangdon,

Apache Arrow's read_all() function is actually doing a lazy loading, so there will be no OOM issue.
Though if you call the .to_pandas() method, then Arrow will load the dataset eagerly and you will face the OOM issue.

Thanks @dandelin,

I see that you call .to_pandas() on the text column:
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58
I guess this operation doesn't load the image data, right?

@zhiqiangdon

Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.