JonathanShor / DoubletDetection

Doublet detection in single-cell RNA-seq data.

Home Page:https://doubletdetection.readthedocs.io/en/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scanpy normalization

adamgayoso opened this issue · comments

Thanks to @yueqiw for trying this. We could add this as an alternative normalization procedure. What do you think @JonathanShor @ambrosejcarr?

import scanpy.api as sc
from scipy.sparse import issparse

def scanpy_normalizer(count_data):
    adata = sc.AnnData(X = count_data)
    if issparse(adata.X):
        adata.obs['n_counts'] = adata.X.sum(axis=1).A1
    else:
        adata.obs['n_counts'] = adata.X.sum(axis=1)
    sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
    filter_result = sc.pp.filter_genes_dispersion(
        adata.X, min_mean=0.02, max_mean=3, min_disp=0.8)
    adata = adata[:, filter_result.gene_subset]
    sc.pp.log1p(adata)
    sc.pp.regress_out(adata, ['n_counts'], n_jobs=8)
    sc.pp.scale(adata, max_value=10)
    return adata.X

check if filtering genes could be done once on raw_counts and gene indices stored for later use

I suppose this would go in plot.py with normalize_counts?

We should break both out into a utils.py, or even a normalizers.py.

@yueqiw would you be interested in creating a PR with your code?

I can comment if you summarize at a high level what this buys us.

@ambrosejcarr This provides users an easier way to use scanpy preprocessing. That said, since we already allow a custom normalizer function, there may be no need to explicitly add this to our code. Do you think it's worth adding given its popularity?

Do you think it's worth adding given its popularity?

If you already have the code, I don't think it would hurt.

I'd be interested but probably won't have time in the next week or so... I can provide my thoughts on using this normalization method.

Parameters and robustness. Since there are quite a few parameters involved (The filter_genes_dispersion function and the regress_out function), we need to decide what keyword arguments to include, and users would need to choose parameters on their dataset. I chose the parameters based on my actual analysis in Seurat, so I think users should be aware that they need to go through the Seurat or Scanpy pipeline to see what parameters make sense. I believe the methods is quite robust and should easily work in most cases, but I've only worked with a few datasets.

Flexibility. I really like the way custom normalizer functions can be easily plugged in. So alternatively, the scanpy normalization can be described in the tutorial as a example of the custom normalizer function.

I agree with @yueqiw showing how it can be used in a tutorial. I would alter the code slightly so that filter_result.gene_subset is calculated only on raw, non-augmented counts for additional speed up.

Closing this due to inactivity.