PattersonWeedLab / IWGC_annotation_pipeline

Current annotation pipeline of the International Weed Genomics Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Current annotation pipeline of the International Weed Genomics Consortium (IWGC)

Authors: Dr. Nathan D. Hall (Developer), Nicholas A. Johnson (Documentation), and Dr. Eric L. Patterson (P.I.)
GitHub Repo: IWGC_annotation_pipeline

Introduction

About

This repository was established to document the genome annotation methods used by the International Weed Genomics Consortium and in the article "Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica" and to also provide a publicly available pipeline or framework for computational genome annotation. The current state of the pipeline works well for the Patterson Lab but will require a substantial time investment for tweaking of small things like directory paths of files or programs and subsequent troubleshooting during the initial setup. It should also be noted that for functional annotation the MetaCyc database requires a usage license (conditionally free for academic use) that will be required for you to obtain or you will have to omit these parts to use the pipeline. We hope at the very least this documentation will provide researchers with a reference for designing their own genome annotation pipeline. We may attempt to release a more distributable version in the future, but there are no set plans currently.

Summary of Structural Annotation

1. Handling Repeats

  • Repeat regions are annotated using RepeatModeler (for program versions see Dependencies).

  • Annotated repeat regions are masked with RepeatMasker to reduce the computational burden of further analysis.

  • bedtools is used to soft mask the genome using the output of RepeatMasker.

2. Map Isoseq Reads to Masked Genome

minimap2 alignment

  • IsoSeq reads are mapped to the repeat-masked genome using minimap2.

SAMtools SAM to BAM

  • SAMtools is used to convert the minimap2 alignment SAM file into a BAM file.

3. Collapse Isoforms

Cupcake:

  • Isoforms are collapsed with Cupcake.

4. Maker

  • Genome, collapsed cDNA from Cupcake, repeat libraries from RepeatModeler, and a protein FASTA from a close relative species are fed into MAKER.

5. Merge and Cleanup

  • Genes that produce proteins under 27 amino acids long are removed from further annotation with only the longest proteins from each gene and unique untranslated regions (UTRs) used for functional annotation.

Summary of Functional Annotation

1. Isoform Selection

  • Isoforms are selected using AGAT which selects the longest isoforms with longest intact open read frame.

  • Isoforms are then extracted from GFF3 format using using gffread

  • These predicted proteins are used for all downstream analysis.

2. Location Prediction

  • MultiLoc2 Workstation Edition (ML2) is used predict protein location within the cell. Briefly, ML2 uses a trained machine learning classifer to predict protein location. For more on this program see the excellent work by Blum et al. (2009).

3. InterPro Analysis

  • InterPro is run locally using iprscan5 version 5.47-82.0.

  • It uses several programs to predict InterPro Accessions which can be broken down into the following categories.

    1. Sites
      • Conserved Site
      • Active Site
      • Binding Site
      • PTM
    2. Repeats
    3. Domains
    4. Families
    5. Homologous SuperFamilies
  • For more information see release notes here.

  • IPRSCAN Results provide MetaCyc accessions but not descriptions. To obtain descriptions a custom python script was used to extract Pathway IDs and link them to their descriptions. MetaCyc is a proprietary database that is made conditionally available to academic researchers free of charge.

4. Homology Search

*This search is for the best hit in each database using MMSeqs2. Which is an extremley fast method of searching that can be up to 36 times faster than BLAST (Steinegger and Söding 2017).

UniRef50

  • UniRef50 is a clustered database that is regularly updated. See work by Suzek et al. (2015).
  • We use UniRef_50 version 2021_03 and downloaded representative plants from the same release to add descriptions.
    Cluster information and UniRef IDs are reported

NCBI

All available proteins from fully sequenced plants were downloaded May 2021 and put into a MMeqs2 database using NCBI Datasets and MMSeqs2 tool

5. Herbicide Interacting Genes

Herbicide interacting genes were identified from the literature and mapped to KEGG Ortholog. UniRef50 IDs were converted to KEGG Orthologs and cross matched with herbicide interacting genes. Matches were flagged and Annotated with HRAC and WSSA classifiers.

Output

The pipeline runs all analysis seperately, but a series of python scripts is used to collect, filter and and format the data. Results are reported in JSON format which is ideal for NOSQL databases and queries. Given the number of Null values or missing values this format is ideal. The results are gappy because not every analysis returns a result for each isoform. This is expected. Finally, GFF3 is produced that contains Ontology Terms, and Notes.

fullsize poster

Small Poster for Reference

thumbnail


Installation

Dependencies

The below program version numbers are the exact versions used by the current IWGC_annotation_pipeline. Different versions may also be compatible but no other versions have been verified to be compatible with this pipeline by the PattersonWeedLab.

Structural Annotation Dependencies:

Functional Annotation Dependencies:


Usage

Genomes are first structurally annotated in terms of repeat regions and gene models. Structural annotation is then used to functionally annotate gene models.

Structural Annotation Usage

1. Handle Repeats

RepeatModeler:

First build a database for RepeatModeler.

BuildDatabase -name Species_Name genome.fasta

Next use repeat database from last step to model repeats. -pa 1 uses 4 threads, -pa 25 uses 100 threads. Adjust accordingly.

RepeatModeler -database Species_Name -pa 25 -LTRStruct

RepeatMasker:

Output from RepeatModeler used as input for RepeatMasker.

RepeatMasker -gff -a -pa 20 -u -lib RepeatModeler_output-families.fa genome.fasta

May need to load module h5py if the above fails.

bedtools:

Output from RepeatMasker used to soft mask repeat regions in genome with bedtools.

bedtools maskfasta -fi genome.fasta -bed RepeatMasker_output.gff -soft -fo genome.softmasked.fasta

2, Map Isoseq Reads to Masked Genome

minimap2 alignment

minimap2 -a -x splice -H -t 100 -O6,24 -B4 genome.softmasked.fasta isoseq.fastq -o output.sam

SAMtools

Prepare algnment for CupCake with samtools.

samtools view -b -T original_genome.fasta minimap2_alignment.sam > minimap2_alignment.bam

3. Collapse Isoforms with Cupcake

CupCake:

Collapse isoforms with CupCake.

python path/to/collapse_isoforms_by_sam.py --input ISOseq.fq --fq -b minimap2.sorted.bam -o Species_Name

4. MAKER (parallel run)

This is our workaround to parallelize MAKER.

Create a list of all FASTA headers to loop through

grep '>' genome.fasta | sed 's/>//g' > chrs.list

maker_split_and_run.sh

For intital setup, ensure all paths to MAKER required executables are accurate on your system in the below section.
#-----Location of Executables Used by MAKER/EVALUATOR

Edit maker_split_and_run.sh for your input files as shown below.

#!/bin/bash 
##==============================================
## files that get modified
genome="Species_Name.softmasked.fasta"
cupcake="Species_Name.collapsed.gff"
##==============================================
## files that don't get modified
repeat_lib="Species_Name-families.fa"
proteins="Related_Species_Name.proteins.fa"

Run maker_split_and_run.sh to separate the genome into different directories by chromosome/scaffold/contig/number of headers in fasta and setup a MAKER.ctl for each.

bash maker_split_and_run.sh

maker_run.sh

Edit the path /path/to/Species_Name/Maker in maker_run.sh shown below to your MAKER working directory.

#!/bin/bash
module load gffread
module load MakerP

while read i
do

cd /path/to/Species_Name/Maker/${i}

nohup maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &

done < chrs.list

Use maker_run.sh to submit MAKER runs in each directory setup by maker_split_and_run.sh.

bash maker_run.sh

5. Merge and Cleanup

GFF merge

Edit Species_Name before running.

global_gff="Species_Name.gff"; printf "##gff-version 3\n" >${global_gff} ;tail -n +2 */*maker.output/*_datastore/*/*/*/*gff | awk -F"\t" 'NF==9 && ($3=="gene" || $3 =="CDS" || $3 =="mRNA" || $3 =="exon" || $3 == "five_prime_UTR" || $3 == "three_prime_UTR" || $3 =="tRNA" )' >>${global_gff}

Make first protein database to guide filtering:

gffread -S -y Species_Name.prots -g Species_Name.genome.fa Species_Name.gff 
samtools faidx Species_Name.prots

Pick smallest protein.

sort -r -k2 -n Species_Name.prots.fai | cut -f -2 | grep est2genome
sort -k2 -n Species_Name.prots.fai | cut -f -2 | grep protein2genome | awk '$2 < 27 {print $1}' | sed 's/-mRNA-[0-9]*//' | sort | uniq > exclude_lt27.list`

Filter out the smallest proteins:

Needs pip install gff3.

printf "##gff-version 3\n" > Species_Name.ge27.gff ; python /path/to/gff_filter.py -e exclude_lt27.list -g Species_Name.gff >> Species_Name.ge27.gff

Make CDSs and UTRs unique:

python /path/to/gffPrepare/validate_gff.py --gff Species_Name.ge27.gff > Species_Name.ge27_uniq.gff

Rename:

Used to rename gff to the Patterson Lab/IWGC common naming convention. Eleusine indica = EleIn. Eleusine indica glyphosate-resistant = EleInR. Species_name = SpeNa.

python /path/to/renameGff.py -g Species_Name.ge27_uniq.gff -t SpeNa > SpeNa.v2.gff

Sort GFF:

/path/to/gff3sort/gff3sort.pl --precise --chr_order natural SpeNa.v2.gff > SpeNa.v2.sorted.gff`

Functional Annotation Usage

Unlike with structural annotation, the functional annotation pipeline is completely contained within a few custom scripts. Config variables and functions are imported from pipeline_config.sh. ml2_one_by_one.sh is referenced by Functional_Annotation_v4.sh.

Functional annotation script:

Once configured, the below command is the only one to run. Get your species' ID from NCBI.

/path/to/Functional_Annotation_v4.sh -G Species_Name.genome.fa -g SpeNa.v2.sorted.gff -s SpeNa -t 102 -i NCBI_Species_Name_ID -n Scientific_Name -c Common_Name

About

Current annotation pipeline of the International Weed Genomics Consortium

License:BSD 2-Clause "Simplified" License


Languages

Language:Shell 73.9%Language:Python 26.1%