What is the suggested way to download images

Question

What is the suggested way to download images

srg9000 opened this issue 2 years ago · comments

What is the suggested way to download the images? Are http requests ok or is there an image dump (like for wikipedia text). The data dump site mentions mirror sites hold the dumps but they are from 2012-2013.
Kaggle competition has 5 tsv files, whereas the repo mentions 10, are the ones on kaggle just part of the full data or is it restructured data? In that case, there is a google cloud storage url which contains ~276 GB of data. Is that the complete dump?

Lena · Answer 1 · Thu May 05 2022 07:42:05 GMT+0800 (China Standard Time)

having the same question as the above mentioned, looking forward to response. :)

Krishna Srinivasan · Answer 2 · Thu May 05 2022 07:59:01 GMT+0800 (China Standard Time)

Greetings. Thanks for your interest. A large number of images (greater than 60%, I would guess) are available via the Kaggle competition we conducted recently.

Please kindly check this site for details:
https://www.kaggle.com/c/wikipedia-image-caption

Unfortunately due to licensing or other such issues, we cannot directly provide these images. The above Kaggle competition downloads should help alleviate the concern. The rest of the images can be fetched via Wikipedia site but in a reasonable and responsible way.

To answer the part about the difference: the WIT dataset linked from this site is the full dataset (37+ Million rows of text data with image urls and metadata). The Kaggle competition used a subset of this and makes it available via the kaggle website. If you take all the rows and unique it by the image urls, you will end up with roughly ~11M unique images. (and that's the total I eluded to above : of this, about 60+% available via the kaggle site itself).

Hope that helps.

Regards,
Krishna.

Hangyu Zhou · Answer 3 · Thu Nov 09 2023 13:29:55 GMT+0800 (China Standard Time)

Hi, just a quick question: For the WIT images hosted on the Kaggle competition, their widths are all 300px. However, when I tried to download the same images using their URL and resize them to the same shape using PIL, they are still slightly different from the WIT images. The WIT ones are sharper and have more texture. I also ran a cosine similarity test between them using the CLIP image encoder, and the similarity is around 96-97%. Could you provide more details on how the Kaggle WIT images are preprocessed so that my manually downloaded images will be consistent with the Kaggle ones? Thanks.

dinhanhx · Answer 4 · Sat Nov 11 2023 19:20:27 GMT+0800 (China Standard Time)

To download images with any tools, it's okay to do so as long as you specify user agent carefully. https://meta.wikimedia.org/wiki/User-Agent_policy