jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Provide information about missing files

hanslovsky opened this issue · comments

I am trying to download all images for source_11 that I can find in the respective load_data_with_illum.parquet files. I found that for these parquet files,

['cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000038/load_data_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000066/load_dat
[source_11-404.csv](https://github.com/jump-cellpainting/datasets/files/12325106/source_11-404.csv)
a_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000070/load_data_with_illum.parquet']

there are 1216 fields/sites with at least one missing image, for a total of 6068 missing images that I attached as CSV in source_11-404.txt (I had to change the extension from txt to csv to attach in this comment). This is what the CSV looks like:

$ head source_11-404.txt
failed-paths
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch5sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch1sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch5sk1fk1fl1.tiff

For example, aws s3 ls on the first file returns in above snippet exits with code 1, i.e. the key does not exist:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff

$ echo $?
1

When I use the same key but change the channel from ch2 to ch1, that file exists:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch1sk1fk1fl1.tiff
2022-12-21 21:35:43    2058750 r11c10f03p01-ch1sk1fk1fl1.tiff

$ echo $?
0

I will double-check that I inferred the correct file names from the parquet files. The existence of ch1 in this example suggests that I inferred the correct names, at least for that field/site.

To find the number of missing fields/sites, I removed the channel sub-string:

$ cat notebooks/data/source_11-404.txt | sed 's/-ch[0-9]sk1fk1fl1.tiff//' | sort | uniq -c | wc -l
1217

Subtract 1 for the CSV header.

Note, I stated originally that I found 1216 wells with at least one missing image, but this is incorrect. I found 1216 fields/sites with at least one missing image.

Thank you @hanslovsky for the detailed report!

These images in source_11 are indeed missing (internal notes: https://github.com/jump-cellpainting/aws/issues/81#issuecomment-1266405250). I will keep this issue open so that we can think of ways to inform the users of the dataset that these files are missing.