AmaliT / mcclintock

Meta-pipeline to identify transposable element insertions using next generation sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Many methods have been developed to detect transposable element (TE) insertions from whole genome shotgun next-generation sequencing (NGS) data, each of which has different dependencies, run interfaces, and output formats. Here, we have developed a meta-pipeline to run five available methods for detecting TE insertions in NGS data, which generates output in the UCSC Browser extensible data (BED) format.

Software Methods

Software Dependencies

All of the software systems must be run on a unix based system with the software dependencies listed per method below. The versions used to run this pipeline are indicated in parentheses and no guarantee is made that it will function using alternate versions.

How to run

###Installation To install the software, from the main pipeline folder, first clone the repository:

git clone git@github.com:bergmanlab/mcclintock.git

Then cd into the project directory and run the script install.sh with no arguments:

cd mcclintock
sh install.sh

This will download and unpack all of the TE detection pipelines and check that the required dependencies are available in your path. Missing dependencies will be reported and you must install or make sure these are available to run the full pipeline.

###Running on a test dataset A script is included to run the full pipeline on a test Illumina resequencing dataset from the yeast genome. To run this test script change directory into the folder named test and run the script runttest.sh.

cd test
sh runtest.sh

This script will download the UCSC sacCer2 yeast reference genome, an annotation of TEs in the yeast reference genome from Carr, Bensasson and Bergman (2012), and a pair of fastq files from SRA, then run the full pipeline.

###Running the pipeline The pipeline is invoked by running the mcclintock.sh script in the main project folder. This script takes the following 6 input files, in the order described, and will run all five TE detection methods:

  • Argument 1: A reference genome sequence in fasta format.
  • Argument 2: The consensus sequences of the TEs for the species in fasta format.
  • Argument 3: The locations of known TEs in the reference genome in GFF 3 format. This must include a unique ID attribute for every entry.
  • Argument 4: A tab-delimited file with one entry per ID in the GFF file and two columns: the first containing the ID and the second containing the TE family it belongs to.
  • Argument 5: The absolute path to the first fastq file from a paired end read, this must be named ending _1.fastq.
  • Argument 6: The absolute path to the second fastq file from a paired end read, this must be named ending _2.fastq.

Data created during pre-processing will be stored in a folder in the main directory named after the reference genome used with individual sub-directories for samples.

###Output format The output of the run scripts is a bed format file with the 4th column containing the name of the TE name and whether it is a novel insertion (new) or a TE shared with the reference (old). The outputs also include a header line for use with the UCSC genome browser. The output files are found within the subdirectory for each specific method in a folder named after the sample with the file name formatted as sample_method.bed.

###Running individual TE detection methods Each folder contains one of the TE detection methods tested in the review. In addition to the standard software there is also a file named runXXXX.sh. Running this file without arguments will explain to the user what input files should be used to execute the method. These arguments should be supplied after the script name with spaces in between, as follows:

runXXXX.sh argument1 argument2 argument3 ...

About

Meta-pipeline to identify transposable element insertions using next generation sequencing data