scanpy normalization

Question

scanpy normalization

adamgayoso opened this issue 6 years ago · comments

Thanks to @yueqiw for trying this. We could add this as an alternative normalization procedure. What do you think @JonathanShor @ambrosejcarr?

import scanpy.api as sc
from scipy.sparse import issparse

def scanpy_normalizer(count_data):
    adata = sc.AnnData(X = count_data)
    if issparse(adata.X):
        adata.obs['n_counts'] = adata.X.sum(axis=1).A1
    else:
        adata.obs['n_counts'] = adata.X.sum(axis=1)
    sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
    filter_result = sc.pp.filter_genes_dispersion(
        adata.X, min_mean=0.02, max_mean=3, min_disp=0.8)
    adata = adata[:, filter_result.gene_subset]
    sc.pp.log1p(adata)
    sc.pp.regress_out(adata, ['n_counts'], n_jobs=8)
    sc.pp.scale(adata, max_value=10)
    return adata.X

Adam Gayoso · Answer 1 · Wed Jul 18 2018 02:08:33 GMT+0800 (China Standard Time)

check if filtering genes could be done once on raw_counts and gene indices stored for later use

Jonathan Shor · Answer 2 · Wed Jul 18 2018 02:32:42 GMT+0800 (China Standard Time)

I suppose this would go in plot.py with normalize_counts?

We should break both out into a utils.py, or even a normalizers.py.

@yueqiw would you be interested in creating a PR with your code?

Ambrose J Carr · Answer 3 · Thu Jul 19 2018 22:01:45 GMT+0800 (China Standard Time)

I can comment if you summarize at a high level what this buys us.

Adam Gayoso · Answer 4 · Fri Jul 20 2018 00:31:36 GMT+0800 (China Standard Time)

@ambrosejcarr This provides users an easier way to use scanpy preprocessing. That said, since we already allow a custom normalizer function, there may be no need to explicitly add this to our code. Do you think it's worth adding given its popularity?

Ambrose J Carr · Answer 5 · Fri Jul 20 2018 12:42:44 GMT+0800 (China Standard Time)

Do you think it's worth adding given its popularity?

If you already have the code, I don't think it would hurt.

Yueqi Wang · Answer 6 · Fri Jul 20 2018 15:35:37 GMT+0800 (China Standard Time)

I'd be interested but probably won't have time in the next week or so... I can provide my thoughts on using this normalization method.

Parameters and robustness. Since there are quite a few parameters involved (The filter_genes_dispersion function and the regress_out function), we need to decide what keyword arguments to include, and users would need to choose parameters on their dataset. I chose the parameters based on my actual analysis in Seurat, so I think users should be aware that they need to go through the Seurat or Scanpy pipeline to see what parameters make sense. I believe the methods is quite robust and should easily work in most cases, but I've only worked with a few datasets.

Flexibility. I really like the way custom normalizer functions can be easily plugged in. So alternatively, the scanpy normalization can be described in the tutorial as a example of the custom normalizer function.

Adam Gayoso · Answer 7 · Sat Jul 21 2018 02:35:45 GMT+0800 (China Standard Time)

I agree with @yueqiw showing how it can be used in a tutorial. I would alter the code slightly so that filter_result.gene_subset is calculated only on raw, non-augmented counts for additional speed up.

Adam Gayoso · Answer 8 · Fri Jun 28 2019 01:12:42 GMT+0800 (China Standard Time)

Closing this due to inactivity.