google-research-datasets / wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Home Page:https://github.com/google-research-datasets/wit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is the suggested way to download images

srg9000 opened this issue · comments

What is the suggested way to download the images? Are http requests ok or is there an image dump (like for wikipedia text). The data dump site mentions mirror sites hold the dumps but they are from 2012-2013.
Kaggle competition has 5 tsv files, whereas the repo mentions 10, are the ones on kaggle just part of the full data or is it restructured data? In that case, there is a google cloud storage url which contains ~276 GB of data. Is that the complete dump?

commented

having the same question as the above mentioned, looking forward to response. :)

Greetings. Thanks for your interest. A large number of images (greater than 60%, I would guess) are available via the Kaggle competition we conducted recently.

Please kindly check this site for details:
https://www.kaggle.com/c/wikipedia-image-caption

Unfortunately due to licensing or other such issues, we cannot directly provide these images. The above Kaggle competition downloads should help alleviate the concern. The rest of the images can be fetched via Wikipedia site but in a reasonable and responsible way.

To answer the part about the difference: the WIT dataset linked from this site is the full dataset (37+ Million rows of text data with image urls and metadata). The Kaggle competition used a subset of this and makes it available via the kaggle website. If you take all the rows and unique it by the image urls, you will end up with roughly ~11M unique images. (and that's the total I eluded to above : of this, about 60+% available via the kaggle site itself).

Hope that helps.

Regards,
Krishna.

Hi, just a quick question: For the WIT images hosted on the Kaggle competition, their widths are all 300px. However, when I tried to download the same images using their URL and resize them to the same shape using PIL, they are still slightly different from the WIT images. The WIT ones are sharper and have more texture. I also ran a cosine similarity test between them using the CLIP image encoder, and the similarity is around 96-97%. Could you provide more details on how the Kaggle WIT images are preprocessed so that my manually downloaded images will be consistent with the Kaggle ones? Thanks.

To download images with any tools, it's okay to do so as long as you specify user agent carefully. https://meta.wikimedia.org/wiki/User-Agent_policy