High doublet rate?

Question

High doublet rate?

yesitsjess opened this issue 3 months ago · comments

I'm getting 33.4% of my UMIs predicted to be doublets (27.5% when clusters=F) and I read somewhere in the region of 10% is more usual. Any suggestions on what might've caused this? Or comments on if I'm doing something wrong, please?

# read 10x cellranger count output
sce <- read10xCounts(paste0(data_dir, samps_dir,  "/outs/filtered_feature_bc_matrix"), samps_dir)

# log normalise, perform PCA and generate UMAP
sce <- scater::logNormCounts(sce)
sce <- scater::runPCA(sce)
sce <- scater::runUMAP(sce)
plotReducedDim(sce, "UMAP")

# get clusters to run doublet finding function using cluster information
sce$cluster <- fastcluster(sce)

# identify suspected doublets
sce <- scDblFinder(sce, clusters="cluster")
#sce <- scDblFinder(sce, clusters=F) # alternatively

table(sce$scDblFinder.class)

I've also tried quickly clustering myself (rather than using fastcluster) and still get 23.2% doublets called.

g <- scran::buildSNNGraph(sce)
cl <- igraph::cluster_fast_greedy(g)$membership
sce$cluster <- cl

My dataset is basically all the same cell type so I would expect a low number of clusters - will this effect things? Also I haven't done any additional QC here, just output from cellranger count is being used (empty droplets filtered out). I was planning to import the doublet predictions from scDblFinder as a QC step in my main pipeline because I'm using cellbender remove-background and wasn't sure if this would render my counts incompatible with doublet detection.

scDblFinder v1.16.0

Pierre-Luc · Answer 1 · Wed Jul 17 2024 21:46:20 GMT+0800 (China Standard Time)

Hi,
what is samps_dir, and ncol(sce) ?

Jess · Answer 2 · Wed Jul 17 2024 22:04:27 GMT+0800 (China Standard Time)

Hi, what is samps_dir, and ncol(sce) ?

samps_dir is a vector containing the sample directory names (as output by cellranger count run)
[1] "SITTA8" "SITTB8" "SITTC8" "SITTD7" "SITTD8" "SITTE7" "SITTE8" "SITTF7" "SITTF8" "SITTG7" "SITTG8" "SITTH8"

> ncol(sce)
[1] 75861

Pierre-Luc · Answer 3 · Wed Jul 17 2024 23:23:33 GMT+0800 (China Standard Time)

It's always a good idea to read the "Getting started" documentation:
https://plger.github.io/scDblFinder/articles/scDblFinder.html#multiple-samples

Pierre-Luc · Answer 4 · Wed Jul 17 2024 23:29:18 GMT+0800 (China Standard Time)

and https://plger.github.io/scDblFinder/articles/scDblFinder.html#im-getting-way-too-many-doublets-called---whats-going-on

Jess · Answer 5 · Thu Jul 18 2024 01:14:16 GMT+0800 (China Standard Time)

So run it sample by sample and not on the whole dataset. Thanks, I'll try it.