jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cpg0012 - error in load_data.csv

tfindley15 opened this issue · comments

Hi there,

Thank you for setting up these amazing datasets! We're having a great time going through them. After some double-checking, I believe there is an error in your load_data.csv files. The AGP and Mito channel filenames are identical. I believe I figured it out...the file names with '_w5' in them are the Mito channel and I can use plate/well/site to correctly assign filenames in the load_data.csv's. If I am correct in that assumption, I have a sqlite db with the correct filenames for the Mito channel if that would be helpful (however only for a subset of the data, ~80 compounds).

Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?

Thank you again for your time and efforts!

Cheers, Reese

Hi Reese, thank you for your interest in the dataset and letting us know about the possible error. Since cpg0012 is our old dataset that was only reprocessed, it probably has some inconsistencies when compared to the other datasets that were generated as a part of the JUMP project.

Regarding your question about the load_data.csv files, I am tagging @ErinWeisbart who recently processed this dataset and may know more about the load_data.csv files.

Additionally, I am having trouble identifying any brightfield images -- are they not included in this dataset?

That's correct. Only the five-channel fluorescent images are available for cpg0012.

Thanks for flagging this @tfindley15!

I suspect something went awry when creating the LoadData CSV files, and that in turn might have trickled all the way downstream

df <- read_csv("https://cellpainting-gallery.s3.amazonaws.com/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/profiles/CDRP/24277/24277.csv.gz", show_col_types = FALSE)

all(df$Cells_Intensity_IntegratedIntensity_AGP == df$Cells_Intensity_IntegratedIntensity_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_Mito)
# [1] TRUE
all(df$Cells_Granularity_10_AGP == df$Cells_Granularity_10_RNA)
# [1] FALSE

More once we dig further into this.

Unfortunately, I was able to confirm that I made a mistake in .csv creation and both AGP and Mito were mapped to the same channel for all .csvs. Thanks for your patience while we get this fixed.

import pandas as pd
import os

mapdf = pd.DataFrame()
mapillumdf = pd.DataFrame()
plates = os.listdir('load_data_csv/CDRP/')
for plate in plates:
    if os.path.exists(f'load_data_csv/CDRP/{plate}/load_data.csv'):
        loaddata = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data.csv')
        loaddataillum = pd.read_csv(f'load_data_csv/CDRP/{plate}/load_data_with_illum.csv')
        mapdf = pd.concat([mapdf,loaddata.iloc[[0]]])
        mapillumdf = pd.concat([mapillumdf,loaddataillum.iloc[[0]]])
keepcols = [col for col in mapdf.columns if 'FileName' in col]
mapdf = mapdf[keepcols]
mapillumdf = mapillumdf[keepcols]
for col in keepcols:
    mapdf[col] = mapdf[col].str.rsplit('_',1,expand=True)[1]
    mapdf[col] = mapdf[col].str[:2]
    print ('load_data', col, mapdf[col].unique())
    mapillumdf[col] = mapillumdf[col].str.rsplit('_',1,expand=True)[1]
    mapillumdf[col] = mapillumdf[col].str[:2]
    print ('load_data_with_illum', col, mapillumdf[col].unique())

load_data FileName_OrigDNA ['w1']
load_data_with_illum FileName_OrigDNA ['w1']
load_data FileName_OrigER ['w2']
load_data_with_illum FileName_OrigER ['w2']
load_data FileName_OrigRNA ['w3']
load_data_with_illum FileName_OrigRNA ['w3']
load_data FileName_OrigAGP ['w4']
load_data_with_illum FileName_OrigAGP ['w4']
load_data FileName_OrigMito ['w4']
load_data_with_illum FileName_OrigMito ['w4']

Hi @tfindley15 -
Thanks again for catching this mistake. All of cpg0012 has now been corrected - I fixed the load_data.csvs and re-ran the analysis and generated new profiles with the help of @niranjchandrasekaran.