lijiunderstand / YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.


We followed the dataset preparation process of DeCLIP here.

  1. First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.

  2. Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.

  3. Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!

    • You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
    • The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
  4. Check the downloaded images using check_images.py.


Dataset infos:

  • The dataset should contains 15,388,848 images.
  • We managed to crawl 15,061,747 of them.
  • Total space occupied: 867.73G.

Web Drive links:


If the link fails, please leave a message in the issue.

About

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).


Languages

Language:Python 91.0%Language:Batchfile 9.0%