royerlab / cytoself

Self-supervised models for encoding protein localization patterns from microscopy images

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataloader for full dataset

sofroniewn opened this issue · comments

Great work here @li-li-github! I'm Nick from CZI SciTech team and am interested in retraining cytoself on the full dataset. I noticed right now you that for the DataManagerOpenCell you don't have a convenience function for downloading the full dataset (like DataManagerOpenCell.download_sample_data) or for then creating a dataloader, like datamanager.const_dataloader.

Downloading the full data is very easy, but the format is then slightly different so I can't just use datamanager.const_dataloader out of the box.

I am thinking about expanding that method so it can handle the full data. Would you be interested in having that be a PR to this repo, or do you already have an alternative recommended way to deal with the full data?

I guess looking at this more it might not be a good idea to adapt that class as it uses the PreloadedDataset and I'm not sure most machines will have enough RAM to load all the images. I could imagine a system that lazily loaded each image before passing it to the gpu, but for that I might want to convert the images to a zarr. Is that the approach you took, or do you have a different one you'd recommend? cc @royerloic

Thank you @sofroniewn for the suggestion. I agree that it would be convenient to have a function to download the full dataset. Since the full dataset is so big, I don't know what would be a good way to programmatically download it (e.g. Do I need to check the available space? How to resume downloading after interruption?).

I might want to convert the images to a zarr.

That would be a good idea. One thing to note is that the reconstruction in VQ-VAE needs to be divided by the variance of the whole data. I personally don't think it would influence the training in practice even if you didn't divide the loss by the variance because the variance is just a constant, but it's required by the theory. So to be theoretically accurate, you would have to compute the variance in one way or another. I hope zarr can help calculate variance without occupying much memory.

Since the full dataset is so big, I don't know what would be a good way to programmatically download it (e.g. Do I need to check the available space? How to resume downloading after interruption?).

You could just include an example script if you don't want to make a full API. I did this and it worked fine

import gdown

BASE_PATH = '/home/ec2-user/cytoself-data/'

data_links = {
   'Image_data00.npy' : 'https://drive.google.com/file/d/15_CHBPT-p5JG44acP6D2hKd8jAacZatp/view?usp=sharing',
   'Image_data01.npy' : 'https://drive.google.com/file/d/1m7Cj2OALiZTIiHpvb9zFPG_I3j1wRnzK/view?usp=sharing',
   'Image_data02.npy' : 'https://drive.google.com/file/d/17nknzqlcYO3n9bAe4FwGVPkU-mJAhQ4j/view?usp=sharing',
   'Image_data03.npy' : 'https://drive.google.com/file/d/1vEsddF68dyOda-hwI-ptAL4vShBGl98Y/view?usp=sharing',
   'Image_data04.npy' : 'https://drive.google.com/file/d/1aB7WaRuhobG_IDl0l_PPeSJAxCYy-Pye/view?usp=sharing',
   'Image_data05.npy' : 'https://drive.google.com/file/d/1qb0waKcLprDtuFAdCec3WegWkmd-U45A/view?usp=sharing',
   'Image_data06.npy' : 'https://drive.google.com/file/d/1y-1vlfZ4eNhvTvpuqTZVL8DvSwYX3CH_/view?usp=sharing',
   'Image_data07.npy' : 'https://drive.google.com/file/d/1ejcPdh-d5lB1OcZ6x8SJx61pEUioZvB2/view?usp=sharing',
   'Image_data08.npy' : 'https://drive.google.com/file/d/1DOicAkruNsU5F4DWLzO2QrV6xU4kuVxs/view?usp=sharing',
   'Image_data09.npy' : 'https://drive.google.com/file/d/1a5YyHeRSRdJStG3KnFe2vsNjrsit9zbf/view?usp=sharing',
   'Label_data00.csv' : 'https://drive.google.com/file/d/1CVwvXW2KhVBbTBixwRXIIiMhrlGDXz-4/view?usp=sharing',
   'Label_data01.csv' : 'https://drive.google.com/file/d/1mTYe5icvWXNfY5wEsuQUhSwgtefBJpjg/view?usp=sharing',
   'Label_data02.csv' : 'https://drive.google.com/file/d/1HckmktklyPo6qbakrwtERsCT34mRdn7l/view?usp=sharing',
   'Label_data03.csv' : 'https://drive.google.com/file/d/1GBxDmWcl_o49i4lGujA8EgIn5G4htkBr/view?usp=sharing',
   'Label_data04.csv' : 'https://drive.google.com/file/d/1G4FpJnlqB3ejmdw3SF2w3DFYt8Wnq0fT/view?usp=sharing',
   'Label_data05.csv' : 'https://drive.google.com/file/d/1Vo1J09qP2TAoXwltCF84socz2TPV92JU/view?usp=sharing',
   'Label_data06.csv' : 'https://drive.google.com/file/d/1d7gJjLTQhOw-e9KZJY9pr6KOCIN8NBvp/view?usp=sharing',
   'Label_data07.csv' : 'https://drive.google.com/file/d/1kr5EF0RA3ZwSXmoaBFwFDVnrokh2EaOE/view?usp=sharing',
   'Label_data08.csv' : 'https://drive.google.com/file/d/1mXyedmLezzty2LSSH3asw0LQeu-ie9mz/view?usp=sharing',
   'Label_data09.csv' : 'https://drive.google.com/file/d/1Vdv1cD75VhvC3FdKTen-5rqLJnWpHvmb/view?usp=sharing',
}


for key, value in data_links.items():
    gdown.download(
        url=value,
        output=BASE_PATH + key,
        quiet=False,
        fuzzy=True,
        use_cookies=False
    )

I might want to convert the images to a zarr.

That would be a good idea ... I hope zarr can help calculate variance without occupying much memory.

I'm curious how did you do it when you did the training? I don't think your full training code is in the repo, right? I guess if you just load the protein image you might be able to fit the full dataset in RAM on a 64GB machine. Alternatively I could reformat the data to look like the sample data where each image is it's own npy file that can be lazily loaded, and then I can probably use the dataloader that you have in the repo. Is that what you did? Any tips/ code pointers on how to train on the whole dataset would be appreciated. Thanks!!

You could just include an example script if you don't want to make a full API. I did this and it worked fine

Oh, I see. yeah, that's an easy solution. You are welcome to make a PR for that.

I'm curious how did you do it when you did the training? I don't think your full training code is in the repo, right?

The way I present in the example script is actually pretty much what I did. It does require a large memory size mostly for preprocessing including variance calculation. If you compute the variance in advance, save the preprocessed data in the disk, and disable all the preprocessing in the datamanager, you could probably train a model with much less RAM, well below 64GB I guess.

I could reformat the data to look like the sample data where each image is it's own npy file

Yes, I agree. That's a very reasonable option. The reason why I didn't do that is because I thought it would generate millions of small files which makes it difficult to copy/transfer/manage. I wonder if zarr can address this problem; namely saving all the data into a small number of files but still being able to access each image without loading the entire data.

Is that what you did?

I created a npy file for each protein/channel, which makes it easy to include or exclude certain proteins/channels. This is exactly how the example script works.

I don't think how I did is the best way because I was using server-level machines, and saving RAM wasn't my priority. My intention was more like "how to save my energy by taking advantage of the computation resource." 😅
So, if you have a better idea please feel free to make a PR and make the repo easier to use for more people.

Oh, I see. yeah, that's an easy solution. You are welcome to make a PR for that.

Ok great, I might do.

The way I present in the example script is actually pretty much what I did. It does require a large memory size mostly for preprocessing including variance calculation. If you compute the variance in advance, save the preprocessed data in the disk, and disable all the preprocessing in the datamanager, you could probably train a model with much less RAM, well below 64GB I guess.

Good to know - that's helpful advice

So, if you have a better idea please feel free to make a PR and make the repo easier to use for more people.

Ok great - I'm currently converting the images to zarr and then will write my own dataloader. If it looks like it will make performance faster and the data easier to work with I'll submit a PR once it is working.

Thanks for your input!!

Hi @li-li-github - I have made some progress trying to reproduce your results with the full dataset. I did end up converting the images to zarr and am using my own dataloader.

My question now is about the additional localization labels. When I read in the .csv file I can get the localizations according to the three grades loc_grade1, loc_grade2, and loc_grade3 - which match the categories here on the opencell data page

Each category is divided into three 'grades': grade 3 indicates a very prominent localization, grade 2 indicates unambiguous but less prominent localization, and grade 1 indicates weak or barely detectable localization

With some proteins having multiple annotations separated by ;

I'd like to remake the umap in the cytoself paper figure 2

image

to confirm that all my code is working right, but I am wondering how did you do the mapping from the localization information above to the 9 categories.

Specifically:

  • Did you only use the categories in loc_grade1 or did you use anything from the other grades?
  • For proteins with no loc_grade1 did you put them in other or just ignore them?
  • For proteins with multiple annotations, did you put them in other or do something else?
  • I get the following 17 unique; separated labels from loc_grade1. How did you combine them to get the 9 labels you used above? (some are obvious, some I'm not sure about).
['cytoplasmic', 'nucleoplasm', 'nuclear_punctae', 'mitochondria',
       'vesicles', 'er', 'membrane', 'centrosome', 'golgi',
       'nuclear_membrane', 'big_aggregates', 'cell_contact',
       'cytoskeleton', 'nucleolus_gc', 'focal_adhesions', 'chromatin',
       'nucleolus_fc_dfc']

Thanks again for you time answering these questions - I'm having a lot of fun trying to reproduce your work. All the amazing code you've provided in this repo already is making it so much easier!!