jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Source of truth for all images with perturbation labels

hanslovsky opened this issue · comments

I am currently preparing JUMP for our image processing pipeline. We are mostly interested in all images plus perturbation labels for each wells. What is the source of truth for all wells in the dataset? I was able to find some sort of metadatafile (Index.idx.xml, indexfile.txt, MeasurementData.mlf) in the images prefix for all plates except for sources 7 and 8. I use that to create my own metadata table and join that with metadata/well.csv.gz for well treatment labels.

Now I found load_data_csv that may actually be a better source for the metadata for all plates except (I did not check plates 7 and 8 yet):

[('source_3', 'C13451bW'),
 ('source_3', 'C13451dW'),
 ('source_3', 'C13495dW'),
 ('source_3', 'J12440d'),
 ('source_3', 'SP16P19c'),
 ('source_3', 'SP24P27c'),
 ('source_3', 'SP24P27d')]

The sample_notebook.ipynb uses load_data_with_illum.parquet. I ran the same analysis for the parquet files and found that the same plates are missing for parquet.

Now I am thinking that I should use metadata/plate.csv.gz to identify all plates, then find the according load_data_with_illum.parquet file for each plate, and download the data that way. Is this the preferred way to download/process the images?

Now I am thinking that I should use metadata/plate.csv.gz to identify all plates, then find the according load_data_with_illum.parquet file for each plate, and download the data that way. Is this the preferred way to download/process the images?

Hi @hanslovsky, I believe you are on the right track. Tagging @shntnu who can confirm if this is the recommended approach.

Awesome, thank you! That makes things a lot easier on my side.

Hello, is it the case that the metadata files for above mentioned plates are actually missing or there is another source of metadata for those particular plates? Thank you! @niranjchandrasekaran

cp_26_all_phenix1/j12440d/
cp_28_all_phenix1/sp24p27c/
cp_25_all_phenix1/c13451bw/
cp_25_all_phenix1/c13451dw/
cp_28_all_phenix1/sp16p19c/
cp_25_all_phenix1/c13495dw/
cp_28_all_phenix1/sp24p27d/

Based on our internal notes, these plates were dropped because they failed QC. However, we retained the images in case we wanted to use them to develop QC approaches.

From https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775:

I would vote for removing these plates from the bucket entirely, just to avoid future confusion. If we want to keep the images to develop QC approaches, I would just delete the <plate>.csv.gz files from their profiles/<batch>/ directory. Then it should be clear, that the augmented, normalized, etc. profiles are not missing.

I've added this to the FAQ issue #62 (comment)