PlantDr430 / TransposableELMT

Wrapper script to identify TEs and mask genome

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TransposableELMT

Wrapper script for TE identification and genome masking

Summary

This script follows some of the main procedures set forth in Coghlan, A., Tsai, I.J., Berriman, M. 2018. Creation of a comprehensive repeat library for a newly sequenced parasitic worm genome. Protocolexchange DOI: 10.1038/protex.2018.054

This is a simple wrapper script that uses multiple repeat finding programs including RepeatModeler, TransposonPSI, LTR_finder, and LTR_harvest. LTR_harvest is coupled with LTR_digest and an HMMsearch against pfam domains associated with LTRs to limit false positive identifications. The constructed libraries are run through RepeatClassifier to classify the LTR's. USEARCH is then used on the concatenated library to remove redundant LTR's based on an 80% similarity. The non-redundant library is then used with RepeatMasker to soft mask the assembly.

Currently, all programs are run using default settings with little to no options to alter settings through flags. Additional options may be added to future versions if there is a need.

It is recommended to provide additional currated libraries such as those from RepBase. Simply select an appropriate taxanomic level and download the file in FASTA format. Then provide the file with the -rb flag on the command line.

Dependencies

Basic programs

  1. Python 3
  2. bedtools
  3. samtools
  4. perl
  5. HMMER

TE programs

  1. RepeatModeler + RepeatClassifer + BuildDatabase
  2. RepeatMasker
  3. LTR_Finder
  4. Genometools
  5. TransposonPSI

Additional

  1. USEARCH
  2. cnv_ltrfinder2gff.pl

Dependecies should be able to be called from the commandline, if not then the paths to the parent directories of each executable should be located in $PATH. If all else fails, paths to executables can be passed into the script through flags.

Usage

usage: ./TransposableELMT.py [options] -in genome_assembly.fasta -o output_basename

optional arguments:
  -h, --help                  show this help message and exit
  -in , --input               Genome assembly in FASTA format
  -o , --out                  Basename of output directory and file
  --cpus                      Number of cores to use [default: 2]
  -id , --identity            Cutoff value for percent identity in USEARCH [default: 0.80]
  -en , --engine              Search engine used in RepeatModeler [abblast|wublast|ncbi] [default: ncbi]
  -rb , --repbase_lib         RepBase library of TEs or additional curated library in FASTA format
  -rl , --repeatmodeler_lib   Pre-computed RepeatModeler library
  --hmms                      Path to directory of TE pfam domain files in HMMER3 format [Default: TransposableELMT/te_hmms]
  --REPEATMODELER_PATH        Path to RepeatModeler exe if not set in $PATH
  --REPEATMASKER_PATH         Path to RepeatMasker exe if not set in $PATH
  --BUILDDATABASE_PATH        Path to BuildDatabase exe if not set in $PATH
  --REPEATCLASSIFIER_PATH     Path to RepeatClassifier exe if not set in $PATH
  --LTRFINDER_PATH            Path to LTR_Finder exe if not set in $PATH
  --GENOMETOOLS_PATH          Path to genometools exe if not set in $PATH
  --USEARCH_PATH              Path to USEARCH exe if not set in $PATH
  --TRANSPOSONPSI_PATH        Path to transposonPSI.pl if not set in $PATH
  --CNV_LTRFINDER2GFF_PATH    Path to cnv_ltrfinder2gff.pl if not set in $PATH

Output files

  1. Soft-masked genome assembly in FASTA format
  2. RepeatMasker Table file
  3. RepeatMasker Out file

About

Wrapper script to identify TEs and mask genome

License:MIT License


Languages

Language:Python 100.0%