This documentation outlines steps to manage VCF files, including compressing, indexing, querying chromosomes, counting variants, and comparing multiple VCF files using BCFTools. The provided instructions are formatted for use on an HPC (High-Performance Computing) environment.
First, create and activate a Conda environment for installing and running BCFTools.
# Create a Conda environment
conda create --name myenv
# Activate the Conda environment
conda activate myenv
# Install BCFTools
conda install -c bioconda bcftools
To efficiently manage large VCF files, compress and index them using bgzip and tabix.
# Compress the .vcf file using bgzip
bgzip filename.vcf
# Create a tabix index file for the bgzip-compressed VCF
tabix -p vcf filename.vcf.gz
# Create an index for the VCF file
bcftools index filename.vcf.gz
Retrieve and save the list of unique chromosomes present in the VCF file.
# Query all chromosomes list
bcftools query -f '%CHROM\n' filename.vcf.gz
# Count the total number of unique chromosomes
bcftools query -f '%CHROM\n' filename.vcf.gz | uniq | wc -l
# Save the list of unique chromosomes in a text file
bcftools query -f '%CHROM\n' filename.vcf.gz | uniq > chromosomes.txt
# Display the contents of the text file
cat chromosomes.txt
Create a shell script to count all variants/mutations per chromosome and save it as chromosome_count.sh
.
# Create a .sh file
touch chromosome_count.sh
# Edit the .sh file using nano
nano chromosome_count.sh
# Add the following content to the file
#!/bin/bash
chromlist=($(cat chromosomes.txt))
for chrom in ${chromlist[@]}; do
count=$(bcftools view -r $chrom filename.vcf.gz | grep -v -c '^#')
echo "$chrom:$count"
done
Use BCFTools to find common variants among multiple VCF files.
# Find common variants among three VCF files
bcftools isec -n=3 filename1.snps.vcf.gz filename2.snps.vcf.gz filename3.snps.vcf.gz | wc -l
Retain only the variants that have a filter status of "PASS".
# Filter variants with "PASS" status
bcftools view -f PASS input.vcf > output.vcf
Compare all records (variants) in the input VCF files for intersection.
# Compare records for intersection
bcftools isec -n=2 -c all -o normal_tumor_common.vcf normal_sample.vcf.gz tumor_sample.vcf.gz
Merge multiple VCF files into one.
# Merge VCF files
bcftools merge --merge all normal_sample.vcf.gz tumor_sample.vcf.gz -O v > normal_tumor_merge.vcf
Compare VCF files and find the unique variants between them.
# Find unique variants between two VCF files
bcftools isec -C normal_sample.vcf.gz tumor_sample.vcf.gz > normal_tumor_unique.vcf
This guide should help you manage VCF files effectively using BCFTools in an HPC environment. For more advanced usage and options, refer to the BCFTools documentation.