jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing metadata parts and a bug for source 11

Arkkienkeli opened this issue · comments

Hello, I found some missing parts in the metadata of source 11:

  • Plates EC000033 and EC000034 (from Batch1) are missing both from plate.csv.gz and well.csv.gz, while there are corresponding image folders.
  • Bug: plate EC000157 in all metadata files in the bucket (csv and parquet) is called EC000157real (Metadata_Plate column). In plate.csv.gz and well.csv.gz it is called EC000157.
  • Plates EC000033 and EC000034 (from Batch1) are missing both from plate.csv.gz and well.csv.gz, while there are corresponding image folders.

This was intentional

From our internal notes:


Based on thenote below, we should exclude EC000033 and EC000034 from analysis (but leave the images in there, as we did in https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775)
"As for the EC barcodes (EC000033 and EC000034) these show as empty new plates, so I am sure they had either DMSO only plated offline or nothing but media"

https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775 said this:

I would vote for removing these plates from the bucket entirely, just to avoid future confusion. If we want to keep the images to develop QC approaches, I would just delete the .csv.gz files from their profiles// directory. Then it should be clear, that the augmented, normalized, etc. profiles are not missing.


So we kept those plates because we thought they may be useful for developing QC methods, but revisiting this now, it is probably confusing to leave them in there.

@Arkkienkeli You can ignore these plates; we will likely delete them

  • Bug: plate EC000157 in all metadata files in the bucket (csv and parquet) is called EC000157real (Metadata_Plate column). In plate.csv.gz and well.csv.gz it is called EC000157.

Thanks for flagging this

I'll drop in some notes for now

Within workspace/analysis/Batch3 there are EC000157 and EC000157real folders. Within the EC000157/analysis/ folder the site folders have the the same stem as each other (e.g. EC000157real-A01-1). Likewise within workspace/backend/Batch3 we see EC000157 and EC000157real folders as they would have been created from the analysis folder. (workspace/load_data_csv/Batch3 has EC000157real but not EC000157.)
It seems that we should clean this up so we are only using 1 (and I would vote for EC000157). I can do a spot check between the analysis folders to confirm they have the same contents and then clean everything up so we don't have to worry about real anywhere?

  • plate EC000157 in all metadata files in the bucket (csv and parquet) is called EC000157real

We should fix this; can you please drop in the URL to these files? @Arkkienkeli

We should fix this; can you please drop in the URL to these files? @Arkkienkeli

cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data.csv.gz
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.csv.gz
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.parquet
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum_and_cell_location.parquet

@Arkkienkeli Could I ask you to actually fix it at your end and upload? (DGX-1 is fine) Since you are already into it, that might be the most efficient

I'd imagine it is as simple as replacing EC000157real with EC000157 in the Metadata_Plate column for all 4 files

@Arkkienkeli Could I ask you to actually fix it at your end and upload? (DGX-1 is fine) Since you are already into it, that might be the most efficient

I'd imagine it is as simple as replacing EC000157real with EC000157 in the Metadata_Plate column for all 4 files

Done, please see our Slack thread on that.

Done, please see our Slack thread on that.

Done

shsingh@dgx-18-04:~$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
2022-12-22 18:37:13      66435 load_data.csv.gz
2022-12-22 18:37:13      88920 load_data_with_illum.csv.gz
2022-12-22 18:37:13     139189 load_data_with_illum.parquet
2023-04-30 14:02:09   10318407 load_data_with_illum_and_cell_location.parquet
shsingh@dgx-18-04:~$ aws s3 sync /dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/ s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data.csv.gz to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data.csv.gz
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum.csv.gz to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.csv.gz
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum.parquet to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.parquet
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum_and_cell_location.parquet to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum_and_cell_location.parquet
shsingh@dgx-18-04:~$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
2024-03-06 13:57:59      59610 load_data.csv.gz
2024-03-06 13:57:59      73125 load_data_with_illum.csv.gz
2024-03-06 13:57:59     130810 load_data_with_illum.parquet
2024-03-06 13:57:59   10318379 load_data_with_illum_and_cell_location.parquet
  • Plates EC000033 and EC000034 (from Batch1) are missing both from plate.csv.gz and well.csv.gz, while there are corresponding image folders.

These plates have now been deleted

https://github.com/jump-cellpainting/aws/issues/81#issuecomment-1999443917

Rationale: #101 (comment)