Teichlab / bbknn

Batch balanced KNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Understanding `neighbors_within_batch` parameter?

chris-rands opened this issue · comments

Thanks for the nice tool! I'm trying to conceptually understand the neighbors_within_batch parameter. I read the docstring, but I'm still not clear exactly what this means? Is it 'k' when approx=True? Setting this value higher leads to a more spread out UMAP (i.e. less correction), which may be preferable for some datasets? Is there a reason for the default value of 3?

bbknn/bbknn/__init__.py

Lines 216 to 218 in 7e736d4

neighbors_within_batch : ``int``, optional (default: 3)
How many top neighbours to report for each batch; total number of neighbours
will be this number times the number of batches.

Thanks for the kind words, sorry for the slightly delayed reply - I need to start regularly checking the email tied to my GitHub again.

BBKNN performs a KNN search for each batch individually, and then merges the resulting neighbour lists together. This parameter is the k for that search, for each batch. The value of 3 stems from the fact that when computing the KNN for the batch a particular cell is from, the returned KNN will include the cell itself as one of the KNN regardless of the neighbour identification algorithm. As such, having fewer than two neighbours within a batch feels excessive. The value can be adjusted if desired, but is kept low as it tends to lead to better correction (as you noticed) while also improving run time.