cpg0012 - error in load_data.csv

Question

cpg0012 - error in load_data.csv

tfindley15 opened this issue 2 years ago · comments

Hi there,

Thank you for setting up these amazing datasets! We're having a great time going through them. After some double-checking, I believe there is an error in your load_data.csv files. The AGP and Mito channel filenames are identical. I believe I figured it out...the file names with '_w5' in them are the Mito channel and I can use plate/well/site to correctly assign filenames in the load_data.csv's. If I am correct in that assumption, I have a sqlite db with the correct filenames for the Mito channel if that would be helpful (however only for a subset of the data, ~80 compounds).

Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?

Thank you again for your time and efforts!

Cheers, Reese

Niranj Chandrasekaran · Answer 1 · Sat Jan 21 2023 01:10:40 GMT+0800 (China Standard Time)

Hi Reese, thank you for your interest in the dataset and letting us know about the possible error. Since cpg0012 is our old dataset that was only reprocessed, it probably has some inconsistencies when compared to the other datasets that were generated as a part of the JUMP project.

Regarding your question about the load_data.csv files, I am tagging @ErinWeisbart who recently processed this dataset and may know more about the load_data.csv files.

Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?

That's correct. Only the five-channel fluorescent images are available for cpg0012.

Shantanu Singh · Answer 2 · Tue Jan 24 2023 04:21:48 GMT+0800 (China Standard Time)

Thanks for flagging this @tfindley15!

I suspect something went awry when creating the LoadData CSV files, and that in turn might have trickled all the way downstream

df <- read_csv("https://cellpainting-gallery.s3.amazonaws.com/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/profiles/CDRP/24277/24277.csv.gz", show_col_types = FALSE)

all(df$Cells_Intensity_IntegratedIntensity_AGP == df$Cells_Intensity_IntegratedIntensity_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_RNA)
# [1] FALSE

More once we dig further into this.

Erin Weisbart · Answer 3 · Wed Jan 25 2023 04:54:35 GMT+0800 (China Standard Time)

Unfortunately, I was able to confirm that I made a mistake in .csv creation and both AGP and Mito were mapped to the same channel for all .csvs. Thanks for your patience while we get this fixed.

import pandas as pd
import os

mapdf = pd.DataFrame()
mapillumdf = pd.DataFrame()
plates = os.listdir('load_data_csv/CDRP/')
for plate in plates:
    if os.path.exists(f'load_data_csv/CDRP/{plate}/load_data.csv'):
        loaddata = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data.csv')
        loaddataillum = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data_with_illum.csv')
        mapdf = pd.concat([mapdf,loaddata.iloc[[0]]])
        mapillumdf = pd.concat([mapillumdf,loaddataillum.iloc[[0]]])
keepcols = [col for col in mapdf.columns if 'FileName' in col]
mapdf = mapdf[keepcols]
mapillumdf = mapillumdf[keepcols]
for col in keepcols:
    mapdf[col] = mapdf[col].str.rsplit('_',1,expand=True)[1]
    mapdf[col] = mapdf[col].str[:2]
    print ('load_data', col, mapdf[col].unique())
    mapillumdf[col] = mapillumdf[col].str.rsplit('_',1,expand=True)[1]
    mapillumdf[col] = mapillumdf[col].str[:2]
    print ('load_data_with_illum', col, mapillumdf[col].unique())

load_data FileName_OrigDNA ['w1']
load_data_with_illum FileName_OrigDNA ['w1']
load_data FileName_OrigER ['w2']
load_data_with_illum FileName_OrigER ['w2']
load_data FileName_OrigRNA ['w3']
load_data_with_illum FileName_OrigRNA ['w3']
load_data FileName_OrigAGP ['w4']
load_data_with_illum FileName_OrigAGP ['w4']
load_data FileName_OrigMito ['w4']
load_data_with_illum FileName_OrigMito ['w4']

Erin Weisbart · Answer 4 · Wed Feb 08 2023 01:44:48 GMT+0800 (China Standard Time)

Hi @tfindley15 -
Thanks again for catching this mistake. All of cpg0012 has now been corrected - I fixed the load_data.csvs and re-ran the analysis and generated new profiles with the help of @niranjchandrasekaran.

Teresa Findley · Answer 5 · Wed Feb 08 2023 08:25:20 GMT+0800 (China Standard Time)

No problem, thanks for all the updates!

…

On Tue, Feb 7, 2023 at 9:45 AM Erin Weisbart ***@***.***> wrote: Hi @tfindley15 <https://github.com/tfindley15> - Thanks again for catching this mistake. All of cpg0012 has now been corrected - I fixed the load_data.csvs and re-ran the analysis and generated new profiles with the help of @niranjchandrasekaran <https://github.com/niranjchandrasekaran>. — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOTK2FHOVUGNPYBPH7OEGBTWWKCZZANCNFSM6AAAAAAT7O2N5Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Teresa M Findley