- Personal yeast variant calling as described in Johnson et al, 2021, eLife
- additional info on steps 3, 5, 6 is found on DataCarpentry
- additional info on step 3, 4, 6 is found on GATK best practices workflow
- using spark-enabled GATK tools on local machine is under develop GATK best practices workflow
- Demultiplex reads (already done by Illumina)
- sorts sequenced reads into separate files for each sample in a sequenced run
- Trimming reads
-create Bash script to iterate through basenames (lacking R1 or R2) and execute NGmerge
NGmerge -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o sample_merged.fastq.gz
- MPI-based parallelized BWA mem alignment to the W303 genome (fastq->SAM) -Output is a tab-delimited text file w/ information for each individual read and its alignment to the genome
- gatk MarkDuplicatesSpark to mark duplicates and sort -MarkDuplicatesSpark utilizes Apache Spark in order to parallelize the process to better take advantage all available resources
- samtools view (SAM->BAM) -Output is a compressed binary version of SAM. This version reduces size and to allows for indexing, which enables efficient random access of the data contained within the file.
- gatk base quality score recal GATK best practices workflow
- gatk ApplyBQSR GATK best practices workflow
- gatk HaplotypeCaller GATK best practices workflow
- Vcftools merge vcfs