cpg0012 - error in load_data.csv
tfindley15 opened this issue · comments
Hi there,
Thank you for setting up these amazing datasets! We're having a great time going through them. After some double-checking, I believe there is an error in your load_data.csv files. The AGP and Mito channel filenames are identical. I believe I figured it out...the file names with '_w5' in them are the Mito channel and I can use plate/well/site to correctly assign filenames in the load_data.csv's. If I am correct in that assumption, I have a sqlite db with the correct filenames for the Mito channel if that would be helpful (however only for a subset of the data, ~80 compounds).
Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?
Thank you again for your time and efforts!
Cheers, Reese
Hi Reese, thank you for your interest in the dataset and letting us know about the possible error. Since cpg0012
is our old dataset that was only reprocessed, it probably has some inconsistencies when compared to the other datasets that were generated as a part of the JUMP project.
Regarding your question about the load_data.csv files, I am tagging @ErinWeisbart who recently processed this dataset and may know more about the load_data.csv files.
Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?
That's correct. Only the five-channel fluorescent images are available for cpg0012
.
Thanks for flagging this @tfindley15!
I suspect something went awry when creating the LoadData CSV files, and that in turn might have trickled all the way downstream
df <- read_csv("https://cellpainting-gallery.s3.amazonaws.com/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/profiles/CDRP/24277/24277.csv.gz", show_col_types = FALSE)
all(df$Cells_Intensity_IntegratedIntensity_AGP == df$Cells_Intensity_IntegratedIntensity_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_RNA)
# [1] FALSE
More once we dig further into this.
Unfortunately, I was able to confirm that I made a mistake in .csv creation and both AGP and Mito were mapped to the same channel for all .csvs. Thanks for your patience while we get this fixed.
import pandas as pd
import os
mapdf = pd.DataFrame()
mapillumdf = pd.DataFrame()
plates = os.listdir('load_data_csv/CDRP/')
for plate in plates:
if os.path.exists(f'load_data_csv/CDRP/{plate}/load_data.csv'):
loaddata = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data.csv')
loaddataillum = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data_with_illum.csv')
mapdf = pd.concat([mapdf,loaddata.iloc[[0]]])
mapillumdf = pd.concat([mapillumdf,loaddataillum.iloc[[0]]])
keepcols = [col for col in mapdf.columns if 'FileName' in col]
mapdf = mapdf[keepcols]
mapillumdf = mapillumdf[keepcols]
for col in keepcols:
mapdf[col] = mapdf[col].str.rsplit('_',1,expand=True)[1]
mapdf[col] = mapdf[col].str[:2]
print ('load_data', col, mapdf[col].unique())
mapillumdf[col] = mapillumdf[col].str.rsplit('_',1,expand=True)[1]
mapillumdf[col] = mapillumdf[col].str[:2]
print ('load_data_with_illum', col, mapillumdf[col].unique())
load_data FileName_OrigDNA ['w1']
load_data_with_illum FileName_OrigDNA ['w1']
load_data FileName_OrigER ['w2']
load_data_with_illum FileName_OrigER ['w2']
load_data FileName_OrigRNA ['w3']
load_data_with_illum FileName_OrigRNA ['w3']
load_data FileName_OrigAGP ['w4']
load_data_with_illum FileName_OrigAGP ['w4']
load_data FileName_OrigMito ['w4']
load_data_with_illum FileName_OrigMito ['w4']
Hi @tfindley15 -
Thanks again for catching this mistake. All of cpg0012 has now been corrected - I fixed the load_data.csvs and re-ran the analysis and generated new profiles with the help of @niranjchandrasekaran.