HaploNet is a framework for inferring haplotype and population structure using neural networks in an unsupervised approach for phased haplotypes of whole-genome sequencing (WGS) data. We utilize a variational autoencoder (VAE) framework to learn mappings to and from a low-dimensional latent space in which we will perform indirect clustering of haplotypes with a Gaussian mixture prior (Gaussian Mixture Variational Autoencoder).
Please cite our paper in Genome Research: Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data.
The HaploNet framework relies on the following Python packages that you can install through conda (recommended) or pip:
- pytorch
- numpy
- cython
- scipy
- scikit-allel
Follow the link to find more information on how to install PyTorch for your setup (GPU/CPU). You can create an environment through conda easily as follows:
conda env create -f environment.yml
git clone https://github.com/Rosemeis/HaploNet.git
cd HaploNet
python setup.py build_ext --inplace
pip3 install -e .
You can now run HaploNet with the haplonet
command.
For phased haplotype data for a single chromosome as a VCF file of M SNPs and N individuals, we can generate a M x 2N haplotype matrix using scikit-allel.
haplonet convert --vcf chr1.vcf.gz --out chr1
# Saves int8 NumPy matrix in binary format (chr1.npy)
HaploNet can now be trained directly on the generated haplotype matrix as follows (using default parameters and on GPU):
haplonet train --geno chr1.npy --cuda --out haplonet
# You can also use the VCF directly as input
haplonet train --vcf chr1.vcf.gz --cuda --out haplonet
HaploNet outputs the neural network log-likelihoods by default which are used to infer global population structure (PCA and admixture). With the '--latent' argument, the parameters of the learnt latent spaces of the GMVAE can be saved as well. See all available options in HaploNet with the following command:
haplonet -h
haplonet train -h # training haplonet
haplonet admix -h # estimate ancestry
haplonet pca -h # perform pca
haplonet convert -h # convert VCF file to NumPy binary format
All the following analyses assume that HaploNet has been run for all chromosomes such that the filepaths of the log-likelihoods for each chromosomes are in one file. The argument "--like" can be used if you only have one chromosome or merged file.
The EM algorithm in HaploNet can be run with K=2 and 64 threads (CPU based).
haplonet admix --filelist chr.loglike.list --K 2 --threads 64 --seed 0 --out haplonet.admixture.k2
And the admixture proportions can as an example be plotted in R as follows:
q <- read.table("haplonet.admixture.k2.q")
barplot(t(q), space=0, border=NA, col=c("dodgerblue3", "firebrick2"), xlab="Individuals", ylab="Proportions", main="HaploNet - Admixture")
Estimate eigenvectors directly using SVD (recommended for big datasets):
haplonet pca --filelist chr.loglike.list --threads 64 --out haplonet
e <- as.matrix(read.table("haplonet.eigenvecs"))
plot(e[,1:2], main="HaploNet - PCA", xlab="PC1", ylab="PC2")
Compute the covariance matrix followed by eigendecomposition in R:
haplonet pca --filelist chr.loglike.list --cov --threads 64 --out haplonet
C <- as.matrix(read.table("haplonet.cov"))
e <- eigen(C)
plot(e$vectors[,1:2], main="HaploNet - PCA", xlab="PC1", ylab="PC2")