Some wells in load_data are missing (but are present in `wells.csv.gz`)
NinoDui opened this issue · comments
Hi there,
I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.
A quick demo of how to revise it:
S3_LOADDATA_FORMATTER = (
"s3://cellpainting-gallery/cpg0016-jump/"
"{Metadata_Source}/workspace/load_data_csv/"
"{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)
nan_index = {
'Metadata_Source': 'source_10',
"Metadata_Batch": '2021_08_12_U2OS_48_hr_run15',
"Metadata_Plate": 'Dest210803-160702'
}
s3_path = S3_LOADDATA_FORMATTER.format(**nan_index)
s3_nan_meta = pd.read_parquet(s3_path, storage_options={"anon": True})
wells_from_parquet = s3_nan_meta['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to C22
wells = pd.read_csv(os.path.join(DATA_ROOT, "metadata", "well.csv.gz"))
wells_plate_info = wells.loc[(wells['Metadata_Source'] == nan_index['Metadata_Source']) & (wells['Metadata_Plate'] == nan_index['Metadata_Plate']), :]
wells_from_plate = wells_plate_info['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to P24
It seems like the well info provided by wells.csv.gz
is larger in amount compared to those retrieved from load_data_with_illum.parquet
. Is that a corner case that I missed? Or is that being uploaded in progress?
Thanks for your time and effort.
Best wishes,
Nino
cc @shntnu
Hey, I was wondering if there's any update about it? Thanks.
Hi @NinoDui, I checked the
load_data_with_illum.parquet
andwell.csv.gz
files and found the same missing wells as you did.@shntnu it looks like
load_data.csv.gz
has all 384 wells, but bothload_data_with_illum.csv.gz
andload_data_with_illum.parquet
have only 70 wells.
Hey Niranj, thank you for all the support. Much appreciated it if we could figure it out. Thanks in advance :)
Update: This comment ended up being off-topic but I am retaining it here because it is useful information
I'll copy over some notes from an internal thread and then respond in the next comment
[...]
Relatedly, there are wells for which there are images, but they don't appear in the metadata files. I think we have 9 of these wells:
plate well
36189 BR00121426 H14
39644 BR00121427 H14
43100 BR00121428 H14
53468 BR00121436 H14
598321 BR00125181 C10
663975 BR00125638 C10
665352 BR00125638 I19
948012 BR00127147 F16
948066 BR00127147 F22
If you take that first one, for example, it is definitely the case that the files exist in the S3 buckets:
$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_08_09_Batch11/images/BR00121426__2021-08-14T15_29_30-Measurement1/Images/ | grep r08c14 | head -5
2022-10-20 22:08:31 1853777 r08c14f01p01-ch1sk1fk1fl1.tiff
2022-10-20 22:05:06 1904709 r08c14f01p01-ch2sk1fk1fl1.tiff
2022-10-20 22:08:32 1712279 r08c14f01p01-ch3sk1fk1fl1.tiff
2022-10-20 22:08:32 1749803 r08c14f01p01-ch4sk1fk1fl1.tiff
2022-10-20 22:05:07 1816711 r08c14f01p01-ch5sk1fk1fl1.tiff
...but it does not exist in the metadata manifest:
In [23]: wells[(wells.Metadata_Plate == "BR00121426") & (wells.Metadata_Well == "H14")]
Out[23]:
Empty DataFrame
Columns: [Metadata_Source, Metadata_Plate, Metadata_Well, Metadata_JCP2022]
Index: []
On further probing, it turned out that the wells were blank.
As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz
I'll add this issue to our FAQ #62
As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz
IIUC, the problem seems to be the other way around. For this plate, there are wells and in well.csv.gz
and the aggregated profile parquet file also has all 384 wells. But the load_data_with_illum
files seem to be missing wells.
Sorry, I totally missed that.
it looks like
load_data.csv.gz
has all 384 wells, but bothload_data_with_illum.csv.gz
andload_data_with_illum.parquet
have only 70 wells.
load_data.csv.gz
seems to have the same number of rows as load_data_with_illum.csv.gz
. can you verify @niranjchandrasekaran ?
aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data.csv.gz -|gunzip - |wc -l
# 417
aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data_with_illum.csv.gz -|gunzip - |wc -l
# 417
That said, I confirmed that the images
folder does have all 384 wells
aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_10/images/2021_08_12_U2OS_48_hr_run15/images/Dest210803-160702/ |gzip > ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz
gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|wc -l
# 384
gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|cut -d" " -f3|sort |uniq -c
# 384 48
load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?
You are right. load_data.csv.gz
also has the same number of wells as the two illum
files.
@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?
Here's how you'd do it
import pandas as pd
source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]
df_all = pd.DataFrame()
for source_id in source_ids:
df = load(
dataset="cpg0016-jump",
source=f"source_{source_id}}",
component="load_data_csv",
batch="2021_06_14_Batch6",
plate="BR00121429",
columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
)
# keep only distinct rows
df = df.drop_duplicates()
df_all = df_all.append(df)
Now report which wells are present in well.csv.gz
but absent in df_all
Warning: I haven't checked the code (that was autogenerated by Copilot :D)
I've just checked with my own comparison script and found 125 mismatches (at plate level) between the number of wells available on a plate and those provided by corresponding load_data_with_illum.parquet
.
Results
The result is attached.
- n_well_on_plate: number of wells calculated by (meta)
plate.csv.gz
andwells.csv.gz
- n_well_from_parquet: number of wells retrieved from
load_data_with_illum.parquet
Script
How the wells from parquet are counted:
S3_LOADDATA_FORMATTER = (
"s3://cellpainting-gallery/cpg0016-jump/"
"{Metadata_Source}/workspace/load_data_csv/"
"{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)
def fetch_well_from_parquet(row: pd.Series):
meta_path = S3_LOADDATA_FORMATTER.format(**row)
meta = pd.read_parquet(meta_path, storage_options=REMOTE_STORAGE_OPTION)
wells_from_parquet = meta['Metadata_Well'].unique()
return wells_from_parquet.shape[0]
How the number of wells from plate & well metas are calculated:
wells_info = wells.merge(plates, on=['Metadata_Source', 'Metadata_Plate'])
wells_info = wells_info[["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"]]
well_count = wells_info.groupby(['Metadata_Source', 'Metadata_Batch', 'Metadata_Plate'])\
.agg(n_well_on_plate=('Metadata_Well', pd.Series.count))
@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?
Here's how you'd do it
import pandas as pd source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13] df_all = pd.DataFrame() for source_id in source_ids: df = load( dataset="cpg0016-jump", source=f"source_{source_id}}", component="load_data_csv", batch="2021_06_14_Batch6", plate="BR00121429", columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"], ) # keep only distinct rows df = df.drop_duplicates() df_all = df_all.append(df)Now report which wells are present in
well.csv.gz
but absent indf_all
Warning: I haven't checked the code (that was autogenerated by Copilot :D)
This is so helpful! Thank you very much.
Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.
We will recreate the load data parquet for that plate.
This might take a while until we get to it. Will that block you?
Thanks again!
April 2024 Update: I took the CSV in #61 (comment) and turned it into a table below for easy searching.
weird
means more_wells_with_profiles_than_with_images
Metadata_Source | Metadata_Plate | n_well_on_plate | n_well_from_parquet | weird |
---|---|---|---|---|
source_1 | UL001659 | 1471 | 1472 | FALSE |
source_1 | UL001661 | 1355 | 1472 | FALSE |
source_1 | UL001649 | 1471 | 1472 | FALSE |
source_1 | UL000083 | 1419 | 1472 | FALSE |
source_1 | UL000093 | 1471 | 1472 | FALSE |
source_1 | UL001769 | 1471 | 1472 | FALSE |
source_1 | UL001771 | 1471 | 1472 | FALSE |
source_1 | UL000597 | 1309 | 1472 | FALSE |
source_1 | UL001717 | 1471 | 1472 | FALSE |
source_10 | Dest210608-152610 | 383 | 384 | FALSE |
source_10 | Dest210608-153057 | 383 | 384 | FALSE |
source_10 | Dest210614-164418 | 382 | 384 | FALSE |
source_10 | Dest210614-164906 | 383 | 384 | FALSE |
source_10 | Dest210615-151336 | 383 | 384 | FALSE |
source_10 | Dest210622-144809 | 383 | 384 | FALSE |
source_10 | Dest210726-155958 | 383 | 384 | FALSE |
source_10 | Dest210726-160957 | 383 | 384 | FALSE |
source_10 | Dest210726-162012 | 383 | 384 | FALSE |
source_10 | Dest210803-160702 | 384 | 70 | TRUE |
source_10 | Dest210809-134534 | 383 | 384 | FALSE |
source_10 | Dest210810-173723 | 383 | 384 | FALSE |
source_10 | Dest210823-173617 | 383 | 384 | FALSE |
source_11 | EC000004 | 383 | 384 | FALSE |
source_11 | EC000038 | 250 | 384 | FALSE |
source_11 | EC000045 | 189 | 384 | FALSE |
source_11 | EC000048 | 126 | 384 | FALSE |
source_11 | EC000068 | 382 | 383 | FALSE |
source_11 | LM37-70_1 | 383 | 384 | FALSE |
source_11 | EC000137 | 250 | 384 | FALSE |
source_11 | EC000115 | 378 | 381 | FALSE |
source_11 | EC000121 | 381 | 383 | FALSE |
source_11 | EC000094 | 360 | 384 | FALSE |
source_11 | LM71-102_1 | 383 | 384 | FALSE |
source_11 | LM71-102_2 | 382 | 384 | FALSE |
source_2 | 1053601763 | 383 | 384 | FALSE |
source_2 | 1053601770 | 383 | 384 | FALSE |
source_2 | 1053601831 | 383 | 384 | FALSE |
source_2 | 1053601909 | 383 | 384 | FALSE |
source_2 | 1053600674 | 383 | 384 | FALSE |
source_2 | 1053600728 | 383 | 384 | FALSE |
source_2 | 1053600803 | 383 | 384 | FALSE |
source_2 | 1053600810 | 383 | 384 | FALSE |
source_2 | 1053600834 | 383 | 384 | FALSE |
source_2 | 1053600858 | 383 | 384 | FALSE |
source_2 | 1053600872 | 383 | 384 | FALSE |
source_2 | 1053599503 | 378 | 384 | FALSE |
source_2 | 1053599602 | 383 | 384 | FALSE |
source_2 | 1053599657 | 383 | 384 | FALSE |
source_2 | 1086293911 | 383 | 384 | FALSE |
source_2 | 1086293997 | 383 | 384 | FALSE |
source_2 | 1086293423 | 383 | 384 | FALSE |
source_2 | 1086293447 | 383 | 384 | FALSE |
source_2 | 1086293485 | 382 | 384 | FALSE |
source_2 | 1086293027 | 383 | 384 | FALSE |
source_2 | 1086293034 | 383 | 384 | FALSE |
source_2 | 1086293133 | 383 | 384 | FALSE |
source_2 | 1086293164 | 383 | 384 | FALSE |
source_2 | 1086293232 | 383 | 384 | FALSE |
source_2 | 1086292723 | 381 | 384 | FALSE |
source_2 | 1086292761 | 383 | 384 | FALSE |
source_2 | 1086292884 | 383 | 384 | FALSE |
source_2 | 1086292976 | 383 | 384 | FALSE |
source_2 | 1086292389 | 383 | 384 | FALSE |
source_2 | 1086292396 | 383 | 384 | FALSE |
source_2 | 1086292440 | 382 | 384 | FALSE |
source_2 | 1086292495 | 382 | 384 | FALSE |
source_2 | 1086291931 | 383 | 384 | FALSE |
source_2 | 1086291948 | 382 | 384 | FALSE |
source_2 | 1086291962 | 382 | 384 | FALSE |
source_2 | 1086291979 | 382 | 384 | FALSE |
source_2 | 1086291986 | 382 | 384 | FALSE |
source_2 | 1086292013 | 381 | 384 | FALSE |
source_2 | 1086292037 | 383 | 384 | FALSE |
source_2 | 1086292044 | 382 | 384 | FALSE |
source_2 | 1086292051 | 382 | 384 | FALSE |
source_2 | 1086292075 | 380 | 384 | FALSE |
source_2 | 1086292082 | 383 | 384 | FALSE |
source_2 | 1086292099 | 380 | 384 | FALSE |
source_2 | 1086292105 | 382 | 384 | FALSE |
source_2 | 1086292136 | 382 | 384 | FALSE |
source_2 | 1086292143 | 382 | 384 | FALSE |
source_2 | 1086292150 | 383 | 384 | FALSE |
source_2 | 1086289792 | 383 | 384 | FALSE |
source_3 | BR5874c3 | 383 | 384 | FALSE |
source_3 | SP01P04b | 381 | 384 | FALSE |
source_3 | B040203d | 382 | 384 | FALSE |
source_3 | A13407bW | 383 | 384 | FALSE |
source_3 | P24P27dW | 319 | 384 | FALSE |
source_3 | DMSOC45 | 383 | 384 | FALSE |
source_3 | JCPQC036 | 383 | 384 | FALSE |
source_3 | BAY5872a | 383 | 384 | FALSE |
source_3 | BAY5874b | 382 | 384 | FALSE |
source_3 | BAY5874c | 383 | 384 | FALSE |
source_3 | BAY5875a | 383 | 384 | FALSE |
source_3 | BAY5875c | 383 | 384 | FALSE |
source_3 | BAY5876d | 383 | 384 | FALSE |
source_4 | BR00121436 | 383 | 384 | FALSE |
source_4 | BR00121428 | 383 | 384 | FALSE |
source_4 | BR00125181 | 383 | 384 | FALSE |
source_4 | BR00125638 | 382 | 384 | FALSE |
source_4 | BR00121427 | 383 | 384 | FALSE |
source_4 | BR00121426 | 383 | 384 | FALSE |
source_4 | BR00127147 | 382 | 384 | FALSE |
source_5 | ACPJUM032 | 382 | 384 | FALSE |
source_5 | APTJUM230 | 383 | 384 | FALSE |
source_5 | ATSJUM107 | 383 | 384 | FALSE |
source_6 | 1.1E+11 | 383 | 384 | FALSE |
source_6 | 1.1E+11 | 383 | 384 | FALSE |
source_6 | 1.1E+11 | 383 | 384 | FALSE |
source_6 | 1.1E+11 | 383 | 384 | FALSE |
source_6 | 1.1E+11 | 383 | 384 | FALSE |
source_7 | CP1-SC1-08 | 381 | 384 | FALSE |
source_7 | CP1-SC1-09 | 383 | 384 | FALSE |
source_8 | A1170444 | 383 | 384 | FALSE |
source_8 | A1170509 | 383 | 384 | FALSE |
source_8 | A1170531 | 383 | 384 | FALSE |
source_8 | A1170533 | 382 | 384 | FALSE |
source_8 | A1170538 | 383 | 384 | FALSE |
source_8 | A1170542 | 382 | 384 | FALSE |
source_8 | A1170543 | 383 | 384 | FALSE |
source_8 | A1170544 | 383 | 384 | FALSE |
source_8 | A1166130 | 383 | 384 | FALSE |
source_8 | A1166132 | 382 | 384 | FALSE |
source_8 | A1166134 | 383 | 384 | FALSE |
source_8 | A1166137 | 383 | 384 | FALSE |
Cool! Glad to know the issue led to limited effect. Thanks for all the checking and effort behind.
I am taking on DL Model experiments based on the images and I could exclude the issued one manually. That's not blocking for short.
Still hope to hear the issue be settled and benefit a larger group of researchers. You're actually doing something not only meaningful but cool.
Best regards,
NinoDui
This is so helpful! Thank you very much.
Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.
We will recreate the load data parquet for that plate.
This might take a while until we get to it. Will that block you?
Thanks again!