EDePasquale / DoubletDecon

A tool for removing doublets from single-cell RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DoubletDecon - Main_Doublet_Decon - no locations are finite error.

kmshort opened this issue · comments

Hi,

I'm testing out DoubletDecon 1.1.4 with data which has been processed using Seurat's standard method, and then using the example provided here on the DoubletDecon git. However the results=Main_Double_Decon routine is throwing an error.

My data is a mus musculus dataset straight from 10X CellRanger.
The routine I'm using is here:

library(Seurat)
library(DoubletDecon)

scDataset_dir <-'D:/temp/filtered_feature_bc_matrix'

scDataset.raw.data <- Read10X(data.dir = scDataset_dir)
scDataset <- CreateSeuratObject(counts = scDataset.raw.data, project = "scDataset", min.cells = 3, min.features = 100)

scDataset <- PercentageFeatureSet(scDataset, pattern = "^mt-", col.name = "percent.mt")
scDataset <- subset(scDataset, subset = nFeature_RNA > 150 & nFeature_RNA < 2500)
scDataset <- NormalizeData(scDataset, normalization.method = "LogNormalize", scale.factor = 10000)
scDataset <- FindVariableFeatures(scDataset, selection.method = "vst", nfeatures = 2000)
all.genes.scDataset <- rownames(scDataset)
scDataset <- ScaleData(scDataset, features = all.genes.scDataset, vars.to.regress = "percent.mt")
scDataset <- RunPCA(scDataset, features = VariableFeatures(object = scDataset))
scDataset <- FindNeighbors(scDataset, dims = 1:15)
scDataset <- FindClusters(scDataset, resolution = 0.5)
scDataset <- RunUMAP(scDataset, dims = 1:15)

This all runs fine, then

filename="scDataset_DoubleDecon_test"
newFiles=Improved_Seurat_Pre_Process(scDataset, num_genes=50, write_files=FALSE)

write.table(newFiles$newExpressionFile, paste0(scDataset_dir, filename, "_expression"), sep="\t")
write.table(newFiles$newFullExpressionFile, paste0(scDataset_dir, filename, "_fullExpression"), sep="\t")
write.table(newFiles$newGroupsFile, paste0(scDataset_dir, filename , "_groups"), sep="\t", col.names = F)

This all runs fine too, but then when I do (with species="hsa" changed to "mmu", for mouse; everything else left as in the example code):

results=Main_Doublet_Decon(rawDataFile=newFiles$newExpressionFile, 
                           groupsFile=newFiles$newGroupsFile, 
                           filename=filename, 
                           location=scDataset_dir,
                           fullDataFile=NULL, 
                           removeCC=FALSE, 
                           species="mmu", 
                           rhop=1.1, 
                           write=TRUE, 
                           PMF=TRUE, 
                           useFull=FALSE, 
                           heatmap=FALSE,
                           centroids=TRUE,
                           num_doubs=100, 
                           only50=FALSE,
                           min_uniq=4,
                           nCores=-1)

Gives me:

Loading packages...
Reading data...
WARNING: if using ICGS2 file input, please import 'rawDataFile' and 'groupsFile' as path/location instead of an R object.
Processing raw data...
Combining similar clusters...

Error in (function (side, at = NULL, labels = TRUE, tick = TRUE, line = NA,  : 
  no locations are finite

The output log shows:

filename: scDataset_DoubleDecon_test
location: D:/temp/filtered_feature_bc_matrix
removeCC: FALSE
species: mmu
rhop: 1.1
write: TRUE
PMF: TRUE
useFull: FALSE
heatmap: FALSE
centroids: TRUE
num_doubs: 100
only50: FALSE
min_uniq: 4
Reading data...
Processing raw data...
11581 samples after processing
219 genes after processing
Combining similar clusters...

I'm not entirely sure what is going on here, can anyone enlighten me?

Thanks.

Hi kmshort,

Thanks for reaching out to me about this error. This error comes up most frequently when an inappropriate value for cluster merging (rhop parameter) is selected for the specific dataset. While 1.1 worked fine for one of the testing datasets I was using (hence it's inclusion as part of the same code), this is unlikely to work for many datasets. To select an appropriate rhop value I usually use the UI, but as you have pointed out in another thread this only works with MacOS. When I'm testing datasets for DD I will usually take the trial and error approach, such as trying rhop values of 0.5, 1, and 1.5. If any of these work then I will usually tweak them up or down until I get the software to generate a cluster merging heatmap that shows the merging I expect (based on my knowledge of the cell types in the data — more on that in a protocol that is currently in progress!). It would be best to kill the run after the heatmap pops up to save you time. Again, this is why the DoubletDeconUI for Windows would be tremendously helpful!

As for the species, that likely isn't related to this error as that parameter is only used when removeCC=TRUE>.

Please let me know if you have any more questions or would like me to elaborate on anything I have mentioned above.

Best,
Erica

HI Erica,

Thankyou. I went from 1.1 down to 0.64 and that is the upper limit that I have gotten to work.

So what should we be looking for in the heatmap? I get something like (image links)
original data
doublets
non doublets

Also, how can we view or re-generate the heatmaps within the "results" object after Main_Doublet_Decon has been run? (I can only view it inline in an rmarkdown script).

Further, how can we re-integrate the results into the starting seurat object?
I'm trying to wrap my head around results$data_processed and groups_processed.

> results$data_processed[1:14,1:2]
                     row_clusters.flat AAACCCACAGTGTGGA
column_clusters-flat                NA                1
Cped1                                1                0
Meis1                                2                0
Meg3                                 2                0
Hmcn1                                2                1
Pbx1                                 2                0
Slc34a1                              3                0
Maf                                  3                0
Acox1                                3                0
Kcnj15                               3                0
Ndrg1                                3                0
Etnk1                                3                0
Malat1                               3               81
Gm26917                              3                2

and the "row_clusters.flat" column is set as doubles, and every cell column is set as characters.
results$data_processed[-1,-1] removes the "column_clusters-flat" and "row_clusters.flat" row/column, but I'm not entirely sure what I'm throwing away by doing that.
So I guess what I'm asking, is how can I massage this to use results$data_processed (new cleaned expression file) as a counts matrix for Seurat processing? Or am I all wrong here. I planned to add it and process it as a counts matrix in another Assay object, and reprocess the data but maybe that is totally unnecessary. Please be kind, I'm new to this.

best regards,
Kieran

Indeed kmshort brought it to the point. A normal "script user" would start with Seurat (which is the Toyota of single cell analysis) and then use a pipeline such as DoubletDecon to be sure that what he/she is seeing is correct and not some doublet cells. So DoubletDecon allows us to enter using Seurat object, however the output is very confusing (at least to me). Heatmaps have colors-annotations for genes and samples (what is even a definition of a sample?) but no legend to describe what those colors mean -e.g. you have no clue in which original cluster are actually the doublets (which is the whole point). I am sure that this information is stored in some of the many many output files but again no clue where to start with.

Hi Kieran and Andrei,

Please email me at careyea@mail.uc.edu or eakbaker@gmail.com (to bypass institutional blocks on attachments) to receive the STAR protocol manuscript. I think a lot of your questions will be answered there and your feedback will help me know where to add more information.

Thanks,
Erica