Any systematic validation/benchmarking of this algorithm?

Question

Any systematic validation/benchmarking of this algorithm?

yueqiw opened this issue 6 years ago · comments

Hi,

This algorithm looks great. I'm wondering if it has been systematically validated on multiple real datasets? And have there been papers published using this approach? I'm interested in using it as part of my pipeline, but before doing that, I'd like to have an idea of how robust it is beyond the PBMC tutorial. Something like precision/recall or other metrics on benchmarking datasets would be helpful, as well as how many cells are needed for robust performance (200, 1000 or 5000 cells)

I just tried it on my data, and it gave me ~10% doublet rate. It's higher than expected (~3% based on 10x genomics user guide), but a good portion of the detected doublets fall into a cluster that I was suspecting to be doublets (no significant marker genes). This is great, although I'm not quite sure if I should simply exclude all these cells for re-analysis ...

Yueqi Wang · Answer 1 · Wed Jun 13 2018 16:41:46 GMT+0800 (China Standard Time)

In addition, ~90% of the cells that are predicted to be doublets in my dataset (using the default p_val cutoff) simply have p=1.0 in almost all iterations, making it hard to further adjust thresholds to reduce the number of detected doublets.

Adam Gayoso · Answer 2 · Wed Jun 13 2018 20:54:22 GMT+0800 (China Standard Time)

No papers have been published yet, but as stated in the README, we have a bioRxiv submission in the works.

Is your data aggregated from multiple runs? If so, I suggest running on each individual run separately to avoid the high predicted doublet rate.

We have a PR (#99) that will fix the p=1.0 issue. Once it's merged, you could install off the dev-v2.3 branch.

Update: This was merged into master so please reinstall the package. You can use the new threshold plot to evaluate different solutions.

Adam Gayoso · Answer 3 · Wed Jun 13 2018 23:56:58 GMT+0800 (China Standard Time)

What kind of data are you applying this to?

Yueqi Wang · Answer 4 · Thu Jun 14 2018 04:03:43 GMT+0800 (China Standard Time)

Yes the data was aggregated from 4 runs and there are batch effects in the raw counts. I tried running on individual runs, and it does seem to help. I also reduced the number of n_top_var_genes to 1500 and got less predicted doublets (now ~4.5%). What's the recommended range for n_top_var_genes?

My data is mixture of neural progenitors and neurons of different subtypes and developmental stages. I'm trying to get rid of across-type doublets such as neuron+progenitor doublets, or excitatory+inhibitory neurons. They could potentially affect downstream analysis such as developmental trajectory.

Yueqi Wang · Answer 5 · Thu Jun 14 2018 04:09:43 GMT+0800 (China Standard Time)

No papers have been published yet

Since the algorithm has been available for a while, I'm curious if other research groups are using this approach in their papers. If we include the doublet-filtered data as part of our publication, it would make the review process easier if (1) the doublet-filtering algorithm has been systematically benchmarked, or (2) other papers are also using this. Thanks!

Adam Gayoso · Answer 6 · Thu Jun 14 2018 04:25:25 GMT+0800 (China Standard Time)

What's the recommended range for n_top_var_genes?

Ideally you would use all genes, the point of this parameter is to reduce runtime while not losing performance. That said, you could set n_top_var_genes to 0, which forces uses of all genes, and feed in an already truncated count matrix created with your method of choice. To get fewer doublets, you should use a stricter p_thresh and/or voting_thresh.

They could potentially affect downstream analysis such as developmental trajectory.

I would be cautious when running on data that is more continuous as it's possible some of the synthetic cross-type doublets resemble particular cell states. You should inspect various threshold combinations using the new threshold plot.

Finally, the benchmarking/validation will come with the bioRxiv submission (soon).

Yueqi Wang · Answer 7 · Thu Jun 14 2018 05:05:26 GMT+0800 (China Standard Time)

I would be cautious when running on data that is more continuous as it's possible some of the synthetic cross-type doublets resemble particular cell states.

Yes. This is very tricky. For neighboring cell types/states, their intermediate states may be quite similar to a linear combination of the two cell states. For cell states that are further apart along the developmental trajectory, I guess the earlier cell state would go through a non-linear process (with some genes only activated during the intermediate state) to become the later one, so the synthetic doublets should only resemble true doublets.

Is it possible to adjust the clustering parameters so that we can use more coarse clustering to make synthetic doublets?

Adam Gayoso · Answer 8 · Thu Jun 14 2018 05:14:12 GMT+0800 (China Standard Time)

The two cells used to make a synthetic doublet are chosen without replacement from the count matrix, there is no clustering aspect before the synthetic doublets are made. Though, you can make the clustering of the augmented dataset coarser using phenograph_parameters={'k'=K} with K greater than the 30 default. This controls how many neighbors are using in PhenoGraph graph construction prior to modularity optimization. If you want a better understanding of PhenoGraph I suggest you look at the supplement of the publication.

You could also run the method as is and ignore predicted doublets in populations you believe to be intermediate states.

Yueqi Wang · Answer 9 · Thu Jun 14 2018 05:29:32 GMT+0800 (China Standard Time)

Thanks! I misunderstood how the doublet parents are chosen (I thought they're chosen based on the clustering). Just to make sure I understand the algorithm correctly: synthetic doublets and original cells are pooled together and clustered together. Each cluster is assigned a P value based on the over-representation of synthetic doublets in that cluster, then all the cell inside that cluster gets the same P value. Is it correct?

You could also run the method as is and ignore predicted doublets in populations you believe to be intermediate states.

Yes, I think it's a good idea to combine prior knowledge with the classifier result.

Adam Gayoso · Answer 10 · Thu Jun 14 2018 05:42:36 GMT+0800 (China Standard Time)

You have it right.

Jonathan Shor · Answer 11 · Thu Jun 14 2018 22:53:15 GMT+0800 (China Standard Time)

If you have cells you suspect are intermediate state cells in your original dataset, you could try removing those and running the method on the remainder. This might allow the clustering to still capture true doublets, provided they were not all inadvertently swept up with your suspected intermediaries

Adam Gayoso · Answer 12 · Sun May 05 2019 05:51:01 GMT+0800 (China Standard Time)

I would like to mention this preprint which has our method as a top performer.

https://www.biorxiv.org/content/biorxiv/early/2019/02/28/564021.full.pdf

Alejandro Jimenez-Sanchez · Answer 13 · Tue Oct 27 2020 06:35:48 GMT+0800 (China Standard Time)

Here another work that benchmarks different methods that detect doublets in scRNA-seq data.

Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data

Seems like DoubletDetection comes second to best after DoubletFinder, although I don't see significant differences after a quick look.