From Fastq to vcf

Juliana Acosta-Uribe Based on pipelines developed by Khalid Mahmood for Melbourne Bioinformatics, University of Melbourne (2021), Daniel Edward Deatherage for the University of Texas (2022), Mohammed Khalfan for New York University, Derek Caetano-Anolles for the Broad institute (2023), Griffith Lab

Data:
Illumina HiSeq paired-end reads in FASTQ format from exomes. Tools:
FastQC, MultiQC, Trimmomatic, BWA-MEM, Picard,

GATK4, Picard, Bcftools and jigv
Reference data:
GATK hg38 bundle of reference files downloaded from (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/).

Tutorial contents table

Section 1: Perform Quality Control to fastq files
Section 2: Map raw mapped reads to reference genome
Section 3: Prepare analysis ready reads
Section 4: Variant calling
Section 5: Filter and prepare analysis ready variants
Section 6: Exporting variant data and visualisation

Section 1: Perform Quality Control to fastq files

1. Quality control with FastQC and MultiQC

The quality control of each .fastq will be analized using FastQC and then, we will use MultiQC to get an .html report of all the fastQC results.

SAMPLES="1 2 3"

for i in $SAMPLES
do 
fastqc ${i}_1.fastq.gz ${i}_2.fastq.gz -o .
done 
multiqc . --filename original_fastq

Check the multiqc_report.html file that is generated

2. Trim and remove adapters with Trimmomatic

for i in $SAMPLES
do 
java -jar trimmomatic-0.39.jar PE ./fastq/${i}_1.fastq.gz ./fastq/${i}_2.fastq.gz 
${i}_1.paired-trimmed.fastq.gz ${i}_1.unpaired-trimmed.fastq.gz \
${i}_2.paired-trimmed.fastq.gz ${i}_2.unpaired-trimmed.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ; done

TruSeq3-PE.fa contains the information about the adapters that were used in your sequencing. You must have the TruSeq3-PE.fa file in the directory you are running this command.

After running trimmomatic you will have 4 new files _1.paired-trimmed.fastq.gz, _1.unpaired-trimmed.fastq.gz, _2.paired-trimmed.fastq.gz and _2.unpaired-trimmed.fastq.gz

3. Check Fastq files again

for i in $SAMPLES
do 
fastqc ${i}_1.paired-trimmed.fastq.gz ${i}_1.unpaired-trimmed.fastq.gz ${i}_2.paired-trimmed.fastq.gz ${i}_2.unpaired-trimmed.fastq.gz -o .
done 

multiqc . --filename pair_trimmed

Section 2: Map raw mapped reads to reference genome

1. Preparation and data import

Prepare your Fastq files, and trim adapters if necessary. Download the reference data

2. Align genome with BWA-MEM

A. Create the BWA index files

bwa index hg38_bundle/Homo_sapiens_assembly38.fasta.gz

This will generate 5 additional files <reference.fasta.gz.amb>, <reference.fasta.gz.ann>, <reference.fasta.gz.bwt>, <reference.fasta.gz.pac>, <reference.fasta.gz.sa>
Note: If the reference is greater than 2GB, you need to specify a different algorithm when building the BWA index, as follows: bwa index -a bwtsw <reference.fasta>

B. Align your fastq files

for i in $SAMPLES
do 
bwa mem -M -t 20 -R "@RG\tID:GNA_${i}\tSM:${i}\tPL:ILLUMINA" \
./hg38_bundle/Homo_sapiens_assembly38.fasta.gz \
./fastq/paired_trimmed/${i}_1.paired-trimmed.fastq.gz \
./fastq/paired_trimmed/${i}_2.paired-trimmed.fastq.gz | \
samtools view -b -h -o ${i}.bam 
done

There are two parts to the command here. The first part uses BWA to perform the alignment and the second part take the output from BWA and uses Samtools to convert the output to the BAM format.

BWA flags:
mem Is used when the query sequences are longer than 70bp
-M This flag tells bwa to consider split reads as secondary, required for GATK variant calling
-t Number of threads (multi-threading mode)
-R <readgroup_info> Provide the readgroup as a string. The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag.\

Samtools flags:
-b, --bam Output in the BAM format
-h, --with-header Include the header in the output.
o FILE, --output File name\

At the end of this step you should have UNSORTED .bam files

You can see that the .bam files have been properly annotated with the Read Groups samtools view -H 1.bam | grep "RG"

Section 2: Prepare analysis ready reads

1. Sort SAM/BAM

The alignment file <name>.bam is not sorted. Before proceeding, we should sort the BAM file using the Picard tools.

for i in $SAMPLES
do
java -jar picard.jar SortSam \
I=${i}.bam  \
O=${i}.sorted.bam  \
CREATE_INDEX=True \
SORT_ORDER=coordinate
done

The above command will create a coordinate sorted BAM file and an index <name.bai> file.

!!! Alignment statistics Given we now have a sorted BAM file, we can now generate some useful statistics. To do so we can use the samtools flagstat command. More details are available here.

samtools flagstat ${i}.sorted.bam

2. Mark duplicate reads

The aim of this step is to locate and tag duplicate reads in the BAM file.
These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant. For example, duplicates can arise during sample preparation e.g. library construction using PCR. Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.
For more details go to MarkDuplicates.

for i in $SAMPLES
do
java -jar picard.jar MarkDuplicates \
I=${i}.sorted.bam   \
O=${i}.sorted.dup.bam  \
M=marked_dup_metrics_${i}.txt
done

Note that this step does not remove the duplicate reads, but rather flags them as such in the read’s SAM record. Downstream GATK tools will ignore reads flagged as duplicates by default.

3. Base quality recalibration

The last step of pre-processing mapped reads is the base quality score recalibration (BQSR) stage. The GATK tools detects systematic errors made by the sequencing machine while estimating the accuracy of each base. The systematic errors can be have various sources ranging from technical machine errors to the variability in the sequencing chemical reactions. The two step BQSR process applies machine learning to model the possible errors and adjust the base quality scores accordingly. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. More details here.

Get gatk running:

git clone https://github.com/broadinstitute/gatk.git
cd gatk/
./gradlew

If you want to install gatk locally you can do ./gradlew localJar More info

Step 1 - Build the model

for i in 2 3
do
gatk --java-options "-Xmx7g" BaseRecalibrator \
-I ${i}.sorted.dup.bam \
-R reference/Homo_sapiens_assembly38.fasta \
--known-sites reference/dbsnp_146.hg38.vcf.gz \
--known-sites reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-O ${i}.recal_data.table
done

#Step 2: Apply the model to adjust the base quality scores

for i in $SAMPLES 
do
gatk ApplyBQSR \
-I ${i}.sorted.dup.bam \
-R reference/Homo_sapiens_assembly38.fasta \
--bqsr-recal-file ${i}.recal_data.table \
-O ${i}.sorted.dup.bqsr.bam 
done

We now have a pre-processed BAM file sample.sorted.dup.bqsr.bam ready for variant calling.

But before we proceed, let's take a detour and run some summary statistics of the alignment data and QC.

Get BQSR statistics

for i in $SAMPLES
do
gatk BaseRecalibrator \
-I ${i}.sorted.dup.bqsr.bam \
-R reference/Homo_sapiens_assembly38.fasta \
--known-sites reference/dbsnp_146.hg38.vcf.gz \
--known-sites reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-O ${i}.post-bqsr.recal_data.table

gatk AnalyzeCovariates \
-before ${i}.recal_data.table \
-after ${i}.post-bqsr.recal_data.table \
-plots ${i}.AnalyzeCovariates.pdf \

done

This will generate a .pdf file. More info here

Get BAM statistics and QC:
The commands below uses FastQC and Picard to generate QC metrics followed by multiQC tools then aggregating the data to produce an HTML report.

for i in $SAMPLES
do
java -jar picard.jar CollectMultipleMetrics \
R=reference/Homo_sapiens_assembly38.fasta \
I=${i}.sorted.dup.bqsr.bam \
O=${i}.sorted.dup.bqsr.CollectMultipleMetrics
done
multiqc . --filename post-bqsr

======================================May 26

Section 3: Variant calling

The next step in the GATK best practices workflow is to proceed with the variant calling.

There are a couple of workflows to call variants using GATK4. Here we will follow the Genomic Variant Call Format (GVCF) workflow which is more suited for scalable variant calling i.e. allows incremental addition of samples for joint genotyping.

1. Apply HaplotypeCaller

HaplotypeCaller is the focal tool within GATK4 to simultaneously call germline SNVs and small Indels using local de-novo assembly of haplotype regions.

!!! Algorithm Briefly, the HaplotypeCaller works by: 1. Identify active regions or regions with evidence of variations. 2. Re-asssemble the active regions. 3. Re-align active region reads to the assembled regions to identify allele. More details about the HaplotypeCaller can be found here.

EXOME_TARGETS=file
for i in $SAMPLES
do
gatk HaplotypeCaller \
-I ${i}.sorted.dup.bqsr.bam \
-R reference/Homo_sapiens_assembly38.fasta \
-ERC GVCF \
#-L $EXOME_TARGETS \
-O ${i}.g.vcf.gz
done

The output of this step is a GVCF file. The format for the GVCF file is similar to a VCF file. The key difference is that the GVCF file contains records for each sequenced genomic coordinate. The --emit-ref-confidence or -ERC parameter lets you select a method to summarise confidence in the genomic site being homozygous-reference. The option -ERC GVCF is more efficient and recommended for large samples and therefore more scalable.

2. Apply CombineGVCFs

The CombineGVCFs tool is applied to combine multiple single sample GVCF files to merge these in to a single multi-sample GVCF file.

Create a text file containing the all GVCFs you want to combine:

ls *.vcf.gz > gvcfs.list

Merge the GVCFs into a single file

gatk CombineGVCFs \
-R reference/Homo_sapiens_assembly38.fasta \
-V gvcfs.list \
#-L $EXOME_TARGETS \
-O merged_gvcf.g.vcf.gz

Now that we have a merged GVCF file, we are ready to perform genotyping.

3. Apply GenotypeGVCFs

gatk GenotypeGVCFs \
-R reference/Homo_sapiens_assembly38.fasta \
-V merged_gvcf.g.vcf.gz
#-L $EXOME_TARGETS \
-O cohort.vcf.gz

??? Information An alternative to CombineGVCFs is GenomicsDBImport, which is more efficient for sample numbers and stores the content in a a GenomicsDB data store. Therefore, CombineGVCFs could be slow and inefficient for more than a few samples. A possible work around is to split up the tasks per interval regions such as chromosomes.

Section 4: Filter and prepare analysis ready variants

1. Variant Quality Score Recalibration

The raw VCF file from the previous step (cohort.vcf.gz) contains 10467 variants. Not all of these are real, therefore, the aim of this step is to filter out artifacts or false positive variants. The GATK Best Practices recommends filtering germline variant callsets with VQSR. A second filtering strategy is called "Hard filtering", which is useful when the data cannot support VQSR or when an analysis requires manual filtering. More detail here

The Variant Quality Score Recalibration or the VQSR strategy is a two step process (1) the first step builds a model that describes how variant metric or quality measures co-vary with the known variants in the training set. (2) The second step then ranks each variant according to the target sensitivity cutoff and applies a filter expression.

Split into SNPs and INDELs

for i in SNP INDEL
do
gatk SelectVariants 
-R reference/Homo_sapiens_assembly38.fasta 
-V cohort.vcf.gz 
--select-type-to-include ${i} 
-O cohort.${i}.vcf.gz 
done

You should end with two new files cohort.SNP.vcf.gz and cohort.INDEL.vcf.gz

Step 1: Build SNP model

gatk VariantRecalibrator 
-R reference/Homo_sapiens_assembly38.fasta 
-V cohort.SNP.vcf.gz 
-O cohort.SNP.recal 
-tranche 100.0 
-tranche 99.9 
-tranche 99.0 
-tranche 90.0 
--tranches-file cohort.SNP.tranches 
--trust-all-polymorphic 
--mode SNP 
--max-gaussians 4 
--resource hapmap,known=false,training=true,truth=true,prior=15.0:reference/hapmap_3.3.hg38.vcf.gz 
--resource omni,known=false,training=true,truth=true,prior=12.0:reference/1000G_omni2.5.hg38.vcf.gz 
--resource 1000G,known=false,training=true,truth=false,prior=10.0:reference/1000G_phase1.snps.high_confidence.hg38.vcf.gz 
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:reference/dbsnp_138.hg38.vcf.gz 
-an QD 
-an MQ 
-an MQRankSum 
-an ReadPosRankSum 
-an FS 
-an SOR 
--rscript-file recalibrate_SNP_plots.R

Note: These parameters are for exome data.

Step 2: Apply recalibration to SNPs

gatk ApplyVQSR 
-R reference/Homo_sapiens_assembly38.fasta 
-V cohort.SNP.vcf.gz 
-O cohort.SNP.vqsr.vcf.gz 
--mode SNP 
--truth-sensitivity-filter-level 99.0 
--tranches-file cohort.SNP.tranches 
--recal-file cohort.SNP.recal

Step 3: Build Indel recalibration model

gatk VariantRecalibrator 
-R reference/Homo_sapiens_assembly38.fasta \
-V cohort.INDEL.vcf.gz \
-O cohort.INDEL.recal \
-tranche 100.0 
-tranche 99.9 
-tranche 99.0 
-tranche 90.0
--tranches-file cohort.INDEL.tranches
--mode INDEL 
--max-gaussians 4 
--resource mills,known=false,training=true,truth=true,prior=12.0:reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz 
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz 
-an QD 
-an FS 
-an SOR 
-an MQRankSum 
-an ReadPosRankSum 
--rscript-file recalibrate_INDEL_plots.R

Step 4: Apply recalibration to INDELs

gatk ApplyVQSR 
-R reference/Homo_sapiens_assembly38.fasta \
-V cohort.INDEL.vcf.gz \
-O cohort.INDEL.vqsr.vcf.gz  \
--mode INDEL 
--truth-sensitivity-filter-level 99.0 
--tranches-file cohort.INDEL.tranches 
--recal-file cohort.INDEL.recal

acostauribe / fastq-to-vcf