cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium

Home Page:https://cumc.github.io/xqtl-protocol/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug in bulk_expression_QC.ipynb RLE candidate outlier output

grennfp opened this issue · comments

I've been testing the bulk_expression_QC.ipynb notebook to conduct QC on a 559 sample RNASeq dataset. All three parts (Hierarchical clustering, D-statistic correlations, RLE) produce candidate outliers, but none of the listed outliers overlap, leading to a final outlier count of zero.

I noticed the samples on the right of the RLE plot (with high IQRs) are not the same as the samples printed out to the log file. The samples printed to the log file for the RLE step are the last 5% of the samples in the input TPM matrix, which aren't the actual RLE outlier samples.

I believe the issue lies in this line of code:

RLEFilterList <- unique(bymedian[((length(bymedian)-ExpPerSample*RLEFilterLength)+1):length(bymedian)]) #filtered

replacing bymedian with levels(bymedian) seemed to fix the issue. Using this code gave me the correct RLE outlier samples:

RLEFilterList <- unique(levels(bymedian)[((length(levels(bymedian))-(RLEFilterLength))+1):(length(levels(bymedian))+1)])

The correct RLE outliers produced from this change overlapped with candidate outliers from the hierarchical clustering and D-statistic steps, unlike before the change when there were no overlaps.

commented

hmm @grennfp I think it is worth a zoom discussion ... maybe between you and @hsun3163 is good enough for starters then Hao can fill me in. Could you guys arrange something offline for next week? You can also show this to us during the Monday WG meeting. Thanks for looking carefully at the diagnosis plot and catching the possible bug!