Teichlab / bbknn

Batch balanced KNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

edge weights

wangjiawen2013 opened this issue · comments

Dear,
I am unfamiliar with graph theory. Why do you convet the neighbour distance collections to exponentially related connectivities ? How to assign weights to the edges ? Does BBKNN construct the connectivity graph with Jaccard index (which is used in Seuart and Scanpy for louvain clustering)?

Hello,

When you run the scanpy workflow, you call scanpy.api.pp.neighbors() before you can run your clustering. The exact stuff that happens under the hood there is that each cell has its top neighbours identified, and those neighbours along with the actual distance captured by the metric of choice are fed into connectivity conversion. The distances are placed on the X axis, the nearest neighbour is set to 1 on the Y axis (when using default parameters), and the rest of the Y axis forms an exponential distribution (with the rate of decline depending on another parameter). This Y axis is reported back as connectivities, with an extra processing step finding mutual neighbours and replacing their two connectivity values C1 and C2 with C1+C2-C1*C2.

In BBKNN, we only alter the first step of the process - instead of identifying the top neighbours in the entire cell pool, we split the dataset into batches and find each cell's neighbours in every batch. We then pass this information into the same connectivity computing function as what scanpy.api.pp.neighbors() uses. The resulting graph is created in the same way as what is done in the scanpy workflow, we just alter the neighbour list to account for the batches in the data.

Hello,

When you run the scanpy workflow, you call scanpy.api.pp.neighbors() before you can run your clustering. The exact stuff that happens under the hood there is that each cell has its top neighbours identified, and those neighbours along with the actual distance captured by the metric of choice are fed into connectivity conversion. The distances are placed on the X axis, the nearest neighbour is set to 1 on the Y axis (when using default parameters), and the rest of the Y axis forms an exponential distribution (with the rate of decline depending on another parameter). This Y axis is reported back as connectivities, with an extra processing step finding mutual neighbours and replacing their two connectivity values C1 and C2 with C1+C2-C1*C2.

In BBKNN, we only alter the first step of the process - instead of identifying the top neighbours in the entire cell pool, we split the dataset into batches and find each cell's neighbours in every batch. We then pass this information into the same connectivity computing function as what scanpy.api.pp.neighbors() uses. The resulting graph is created in the same way as what is done in the scanpy workflow, we just alter the neighbour list to account for the batches in the data.

Does C1 equals to C2 ? Because the distance between the mutual neighbours are the same.

And, In BBKNN BioRxiv paper, the dataset was down-sampled to guarantee more even cell population
sizes. So, is it still robust when the sample/subpopulation size differs a lot ? How about using the raw dataset (the size is uneven) ? Is it necessary to set different k value (top k neighbours) according to the cells number per batch?

The distance between the neighbours is the same, but this can result in different values from the exponential distribution in the procedure I described. As such, C1 would not equal C2, but replacing both of those values with C1+C2-C1*C2 "fixes" that. I'd like to reiterate that this is in no way BBKNN specific, this is what the UMAP neighbour graph construction algorithm does. This algorithm is used by scanpy.api.pp.neighbors().

Check Supplementary Figure 4 - you can see the complete mouse atlas collection analysed with BBKNN in b and c. I think that the downsampling was largely presentation driven, as you can see the HSCs overpower the manifold in sheer size. Nothing to do with the robustness to differing dataset sizes, as the pancreas data was quite uneven in terms of cell count. I did not perform this part of the analysis, you'll need to contact Jong-Eun Park if you wish to discuss this further.