Missing metadata parts and a bug for source 11
Arkkienkeli opened this issue · comments
Hello, I found some missing parts in the metadata of source 11:
- Plates
EC000033
andEC000034
(fromBatch1
) are missing both fromplate.csv.gz
andwell.csv.gz
, while there are corresponding image folders. - Bug: plate
EC000157
in all metadata files in the bucket (csv and parquet) is calledEC000157real
(Metadata_Plate
column). Inplate.csv.gz
andwell.csv.gz
it is calledEC000157
.
- Plates
EC000033
andEC000034
(fromBatch1
) are missing both fromplate.csv.gz
andwell.csv.gz
, while there are corresponding image folders.
This was intentional
From our internal notes:
Based on thenote below, we should exclude EC000033 and EC000034 from analysis (but leave the images in there, as we did in https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775)
"As for the EC barcodes (EC000033 and EC000034) these show as empty new plates, so I am sure they had either DMSO only plated offline or nothing but media"
https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775 said this:
I would vote for removing these plates from the bucket entirely, just to avoid future confusion. If we want to keep the images to develop QC approaches, I would just delete the .csv.gz files from their profiles// directory. Then it should be clear, that the augmented, normalized, etc. profiles are not missing.
So we kept those plates because we thought they may be useful for developing QC methods, but revisiting this now, it is probably confusing to leave them in there.
@Arkkienkeli You can ignore these plates; we will likely delete them
- Bug: plate
EC000157
in all metadata files in the bucket (csv and parquet) is calledEC000157real
(Metadata_Plate
column). Inplate.csv.gz
andwell.csv.gz
it is calledEC000157
.
Thanks for flagging this
I'll drop in some notes for now
Within workspace/analysis/Batch3 there are EC000157 and EC000157real folders. Within the EC000157/analysis/ folder the site folders have the the same stem as each other (e.g. EC000157real-A01-1). Likewise within workspace/backend/Batch3 we see EC000157 and EC000157real folders as they would have been created from the analysis folder. (workspace/load_data_csv/Batch3 has EC000157real but not EC000157.)
It seems that we should clean this up so we are only using 1 (and I would vote for EC000157). I can do a spot check between the analysis folders to confirm they have the same contents and then clean everything up so we don't have to worry about real anywhere?
- plate
EC000157
in all metadata files in the bucket (csv and parquet) is calledEC000157real
We should fix this; can you please drop in the URL to these files? @Arkkienkeli
We should fix this; can you please drop in the URL to these files? @Arkkienkeli
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data.csv.gz
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.csv.gz
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.parquet
cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum_and_cell_location.parquet
@Arkkienkeli Could I ask you to actually fix it at your end and upload? (DGX-1 is fine) Since you are already into it, that might be the most efficient
I'd imagine it is as simple as replacing EC000157real
with EC000157
in the Metadata_Plate
column for all 4 files
@Arkkienkeli Could I ask you to actually fix it at your end and upload? (DGX-1 is fine) Since you are already into it, that might be the most efficient
I'd imagine it is as simple as replacing
EC000157real
withEC000157
in theMetadata_Plate
column for all 4 files
Done, please see our Slack thread on that.
Done, please see our Slack thread on that.
Done
shsingh@dgx-18-04:~$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
2022-12-22 18:37:13 66435 load_data.csv.gz
2022-12-22 18:37:13 88920 load_data_with_illum.csv.gz
2022-12-22 18:37:13 139189 load_data_with_illum.parquet
2023-04-30 14:02:09 10318407 load_data_with_illum_and_cell_location.parquet
shsingh@dgx-18-04:~$ aws s3 sync /dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/ s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data.csv.gz to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data.csv.gz
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum.csv.gz to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.csv.gz
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum.parquet to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum.parquet
upload: ../../dgx1nas1/cellpainting-datasets/JUMP_cpg0016/fix_s11_EC000157/load_data_with_illum_and_cell_location.parquet to s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/load_data_with_illum_and_cell_location.parquet
shsingh@dgx-18-04:~$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_11/workspace/load_data_csv/Batch3/EC000157/
2024-03-06 13:57:59 59610 load_data.csv.gz
2024-03-06 13:57:59 73125 load_data_with_illum.csv.gz
2024-03-06 13:57:59 130810 load_data_with_illum.parquet
2024-03-06 13:57:59 10318379 load_data_with_illum_and_cell_location.parquet
- Plates
EC000033
andEC000034
(fromBatch1
) are missing both fromplate.csv.gz
andwell.csv.gz
, while there are corresponding image folders.
These plates have now been deleted
https://github.com/jump-cellpainting/aws/issues/81#issuecomment-1999443917
Rationale: #101 (comment)