title	author	date
kBET short introduction	Maren Büttner	9/18/2017

kBET - k-nearest neighbour batch effect test

The R package provides a test for batch effects in high-dimensional single-cell RNA sequencing data. It evaluates the accordance of replicates based on Pearson's $\chi^2$ test. First, the algorithm creates k-nearest neighbour matrix and choses 10% of the samples to check the batch label distribution in its neighbourhood. If the local batch label distribution is sufficiently similar to the global batch label distribution, the $\chi^2$-test does not reject the null hypothesis (that is "all batches are well-mixed"). The neighbourhood size k is fixed for all tests. Next, the test returns a binary result for each of the tested samples. Finally, the result of kBET is the average test rejection rate. The lower the test result, the less bias is introduced by the batch effect. kBET is very sensitive to any kind of bias. If kBET returns an average rejection rate of 1 for your batch-corrected data, you may also consider to compute the average silhouette width and PCA-based batch-effect measures to explore the degree of the batch effect. Learn more about kBET and batch effect correction in our bioRxiv pre-print.

Installation

Installation should take less than 5 min.

Via Github and devtools

If you want to install the package directly from Github, I recommend to use the devtools package.

library(devtools)
install_github('theislab/kBET')

Manually

Please download the package as zip archive and install it via

install.packages('kBET.zip', repos = NULL, type = 'source')

Usage of the kBET function:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
batch.estimate <- kBET(data, batch)

kBET creates (if plot = TRUE) a boxplot of the kBET rejection rates (for neighbourhoods and randomly chosen subsets of size k) and kBET returns a list with several parts:

summary: summarizes the test results (with 95% confidence interval)
results: the p-values of all tested samples
average.pval: an average over all p-values of the tested samples
stats: the results for each of n_repeat runs - they can be used to reproduce the boxplot that is returned by kBET
params: the parameters used in kBET
outsider: samples without mutual nearest neighbour, their batch labels and a p-value whether their batch label composition varies from the global batch label frequencies

For a single-cell RNAseq dataset with less than 1,000 samples, the estimated run time is less than 2 minutes.

Plot kBET's rejection rate

By default (plot = TRUE), kBET returns a boxplot of observed and expected rejection rates for a data set. You might want to turn off the display of these plots and create them elsewhere. kBET returns all information that is needed in the stats part of the results.

library(ggplot2)
batch.estimate <- kBET(data, batch, plot=FALSE)
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
g <- ggplot(plot.data, aes(class, data)) + geom_boxplot() + 
     labs(x='Test', y='Rejection rate',title='kBET test results') +
     theme_bw() +  
     scale_y_continuous(limits=c(0,1))

Variations:

The standard implementation of kBET performs a k-nearest neighbour search (if knn = NULL) with a pre-defined neighbourhood size k0, computes an optimal neighbourhood size (heuristics = TRUE) and finally 10% of the samples is randomly chosen to compute the test statistics itself (repeatedly by default to derive a confidence interval, n_repeat = 100). For repeated runs of kBET, we recommend to run the k-nearest neighbour search separately:

require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)

Compute a silhouette width and PCA-based measure:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
pca.data <- prcomp(data, center=TRUE) #compute PCA representation of the data
batch.silhouette <- batch_sil(pca.data, batch)
batch.pca <- pcRegression(pca.data, batch)