YeastVC

Personal yeast variant calling as described in Johnson et al, 2021, eLife
additional info on steps 3, 5, 6 is found on DataCarpentry
additional info on step 3, 4, 6 is found on GATK best practices workflow
using spark-enabled GATK tools on local machine is under develop GATK best practices workflow

Steps

Trimming reads -create Bash script to iterate through basenames (lacking R1 or R2) and execute NGmerge NGmerge -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o sample_merged.fastq.gz
MPI-based parallelized BWA mem alignment to the W303 genome (fastq->SAM) -Output is a tab-delimited text file w/ information for each individual read and its alignment to the genome
gatk MarkDuplicatesSpark to mark duplicates and sort -MarkDuplicatesSpark utilizes Apache Spark in order to parallelize the process to better take advantage all available resources
samtools view (SAM->BAM) -Output is a compressed binary version of SAM. This version reduces size and to allows for indexing, which enables efficient random access of the data contained within the file.
gatk base quality score recal GATK best practices workflow
gatk ApplyBQSR GATK best practices workflow
gatk HaplotypeCaller GATK best practices workflow
Vcftools merge vcfs