GeorgeBatch / TCGA-lung-download-GD

Scripts for downloading WSI-TCGA-LUNG from Google Drive using gdown

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Download TCGA-lung files and Annotations from Google Drive

Source: WSI-TCGA-Lung Google Drive folder provided by Bin Li in DSMIL: Multiple instance learning networks for tumor detection in Whole Slide Image repository.

This repository relies on the gdown library which can be used both from command line or from within Python scripts/notebooks.

Important: The library does not work for downloading folders with more than 50 files. This is why the first step is to get the links (ids) for the individual subfolders each of which has less than 50 files and iterate over them.

1. Create csv files with names and links

Credit: this part is a copied and modified version of the solution provided to this question on stackexchange.

Use Google Sheets functionality to get all the names and links into a google sheet and then download as multiple csv files. This folder contains my google sheet. script.txt file is not needed to be there - it is made to record code.

Save the 3 sheets from the google sheets files into separate csv files in names_and_links.

2. Create files to record which of the files have already been downloaded and download all files

cd download_scripts/

# keep a record so that if the download is interrupted,
# there is a way to resume the download  instead of
# starting from scratch
python create_record_files.py

# IDs and annotations
python download_ids.py

# corrupt files and annotations
python download_corrupt_WSIs.py

# good files and annotations
python download_good_WSIs.py

Potential problems

Google Drive might stop giving access to files when many files from the same Google Drive folder are requested using code (my idea of why it happens). This will not result in the script execution interrupting, but instead give warnings, which I was not able to catch with try-except construct in Python. This may result in folders with missing files being downloaded.

Access denied with the following error:

 	Too many users have viewed or downloaded this file recently. Please
	try accessing the file again later. If the file you are trying to
	access is particularly large or is shared with many people, it may
	take up to 24 hours to be able to view or download the file. If you
	still can't access a file after 24 hours, contact your domain
	administrator.

You may still be able to access the file from the browser:
%%%URLHERE

This is an open issue with the gdown library. The problem seems to be on Google's side.

TODO:

  1. Catch warnings and interrupt execution or at least not update the download status in the record files so that the download is not marked complete.

About

Scripts for downloading WSI-TCGA-LUNG from Google Drive using gdown


Languages

Language:Jupyter Notebook 88.3%Language:Python 11.7%