COCO-style DataLoader

Question

COCO-style DataLoader

afiaka87 opened this issue 3 years ago · comments

I would love to start training with this! I helped to write a Dataloader for the "COCO" format i.e. images and text files containing line separated captions. They are matched in the data loader via the unique basename of each file.

https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/loader.py

Would it be possible to port that data loader to this project? It is perhaps of interest to some folks I know with some spare compute. Also personally useful to me, because I have converted a good deal of my collected datasets to this format already.

Thanks!

Cade Gordon · Answer 1 · Tue May 18 2021 21:35:02 GMT+0800 (China Standard Time)

That would be amazing! Data is, quite obviously, a weak point at the moment. That code should be immediately compatible if you swap the order of images and text in the return statement.

Would you be interested in sending a PR?

Cade Gordon · Answer 2 · Fri May 21 2021 14:21:49 GMT+0800 (China Standard Time)

I've written up a DataModule based on the file you linked and made sure to attribute the original repo. Feel free to message me if you would like any updates to the citation!

Clay Mullis · Answer 3 · Sat May 22 2021 04:16:29 GMT+0800 (China Standard Time)

Wow thanks! These links will need to expire in a few days - but here are many of the datasets i have in that format:

They are all resized to 256 px.

wget https://www.dropbox.com/s/p0qwhefid4p8q0u/blog_captions.tar.gz # OpenAI's `DALL-E` Blog Post - All 1.1 million generations scraped.
wget https://www.dropbox.com/s/a4jx0pe6oc1e5r7/openai_dalle_gen.tar.gz # 1.1 million image-text pairs

# 500k image-text pairs
wget https://www.dropbox.com/s/8rue7r9ppds3jyk /open_images_localized_annotations.tar.gz

# 100k image-text pairs
wget https://www.dropbox.com/s/pvqwvj8evd5z3so/virtual_genome.tar.gz

# 200k image-text pairs
wget https://www.dropbox.com/s/txuzmca8ugk9uoe/coco2017.tar.gz

# 70k image-text pairs
wget https://www.dropbox.com/s/2fo9gipvxys5ys0/food101.tar.gz

# this isn't _all_ of conceptual_12m, but a good chunk of it. something like 4 million image-text pairs?
wget https://www.dropbox.com/s/zgkknj9feh65py0/conceptual_captions_train_256.zip

@Zasder3 Let me know if you need anything else - I'm a bit busy at the moment but I'll be able to test the data loader tomorrow.

Clay Mullis · Answer 4 · Sun May 23 2021 09:06:08 GMT+0800 (China Standard Time)

@Zasder3 - Unfortunately I've had issues with Dropbox limiting my ability to download my own files if they get downloaded too often. Do you intend to download these yourself?

Cade Gordon · Answer 5 · Sun May 23 2021 10:49:40 GMT+0800 (China Standard Time)

@afiaka87 Oh sorry I didn't mean to mess with your Dropbox usage abilities! I'm currently downloading one of them but that will conclude in 30 minutes. If you are interested in having these remain public I could help work out a solution? Personally, I know archive.org is a good place for larger files for free.

Clay Mullis · Answer 6 · Sun May 23 2021 10:57:25 GMT+0800 (China Standard Time)

@afiaka87 Oh sorry I didn't mean to mess with your Dropbox usage abilities! I'm currently downloading one of them but that will conclude in 30 minutes. If you are interested in having these remain public I could help work out a solution? Personally, I know archive.org is a good place for larger files for free.

Yeah that would be very useful to me! And download all of them if you think it'll speed up development! You didn't do anything wrong I just wanted to make sure you get what you need before the links expire in a few days.

BTW - if you're looking for inspiration, https://GitHub.com/CompViz/taming-transformers is one of the best pytorch-lightning codebases I've seen and ships with a bunch of DataLoaders for segmentation, pose, classes, etc.

Clay Mullis · Answer 7 · Sun May 23 2021 10:58:15 GMT+0800 (China Standard Time)

How do you go about submitting an upload to archive.org? I can get started on that myself if possible.

Clay Mullis · Answer 8 · Sun May 23 2021 11:02:30 GMT+0800 (China Standard Time)

Also - lucidrains has an implementation of CLIP as well in the github.com/lucidrains/DALLE-pytorch

I'm pretty sure most people don't even know it's there... I tried training with it once but received an error. Phil (lucidrains) is a super smart guy and definitely has more insight into various state of the art ML architectures than I do though. Might be worth looking into.

Cade Gordon · Answer 9 · Sun May 23 2021 13:13:17 GMT+0800 (China Standard Time)

@afiaka87 These are super helpful resources thanks a ton. I now kind of realize why I didn't go too far with archive.org, they had super slow download speeds when I was using them with colab. It still works however. If you create an account and click on your profile in the upper right-hand corner, an option for uploading is available in the dropdown menu.