Teichlab / bbknn

Batch balanced KNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scanorama bbknn

wangjiawen2013 opened this issue · comments

Dear,
Scanorama handles the mutual nearest neighbors-based matching, batch correction, and panorama assembly. I have not find assembly function in pancreas-4-Scanorama.ipynb. what's the corresponding function of scanorama's assembly function in bbknn (or scanpy)?

I have no idea what you're asking. BBKNN/scanpy don't assemble panoramas. BBKNN's output is a batch-balanced graph (which can be used for UMAP, clustering and so on), it does not currently correct the expression space in any way.

Assuming you're actually asking about how scanorama works, scanorama's data correction is performed by this little timed chunk of code that needs to be written out to bin/4panc.py as per notebook instructions:

t1 = time.time()
datasets, genes = correct(datasets, genes_list)
datasets = [ normalize(ds, axis=1) for ds in datasets ]
t2 = time.time()

This creates a corrected expression space, which is dumped out by save_datasets(datasets, genes, data_names). This is subsequently imported into the notebook and processed in a manner consistent with the other analyses.

According to scanorama's document (scanorama.py), the datesets have already been normalized when executing "datasets, genes = correct(datasets, genes_list)". The normalize() function is included in correct() function, why is it executed again here ? Are there any particular purposes ?

In pancreas-4-Scanorama.ipynb, the corrected datasets have not been processed with sc.pp.log1p() and
sc.pp.normalize_per_cell (), and high variable genes have not been identified with sc.pp.filter_genes_dispersion(). All of them are necessary in routine scanpy pipeline. Are these process could be skipped when treating scanorama-corrected datasets ?

Notice how scanorama outputs a filtered gene space and altered expression. At the time I cloned scanorama (31.07), a number of scripts in bin/ started like this:

if __name__ == '__main__':
	datasets, genes_list, n_cells = load_names(data_names)
	datasets, genes = correct(datasets, genes_list)
	datasets = [ normalize(ds, axis=1) for ds in datasets ]
	datasets_dimred = dimensionality_reduce(datasets)

I am quite busy at the moment and cannot promise any further assistance in a timely manner.