cyversewarwick / apples

The APPLES :green_apple: software package is a set of tools to analyse promoter sequences on a genome-wide scale.

Home Page:http://www2.warwick.ac.uk/fac/sci/dcs/people/sascha_ott/tools_and_software/apples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

APPLES (Analysis of Plant Promoter-Linked Elements)

APPLES logo

APPLES is a set of tools to analyse promoter sequences on a genome-wide scale. In this CyVerse-compatible version, two main modules are provided:

  • APPLES_rbh: Find Orthologs as Reciprocal Best Hits
  • APPLES_conservation: Find Non-Coding Conserved Regions

In addition, the following tools are also exposed to the user:

  • APPLES_utr: Extract sequences based on FASTA and GFF3 files

The following diagram illustrates the structure of these modules:

APPLES workflow

Background

The original APPLES package is described at this address

Publications

  • Nathaniel J. Davies, Peter Krusche, Eran Tauber and Sascha Ott, Analysis of 5’ gene regions reveals extraordinary conservation of novel non-coding sequences in a wide range of animals, BMC Evolutionary Biology, 2015, doi: 10.1186/s12862-015-0499-6

  • Laura Baxter, Aleksey Jironkin, Richard Hickman, Jay Moore, Christopher Barrington, Peter Krusche, Nigel P. Dyer, Vicky Buchanan-Wollaston, Alexander Tiskin, Jim Beynon, Katherine Denby, and Sascha Ott, Conserved Noncoding Sequences Highlight Shared Components of Regulatory Networks in Dicotyledonous Plants, Plant Cell, 2012, doi:10.1105/tpc.112.103010

Modules

APPLES_conservation_multiple

Inputs

🐳 Species String This is a string of Species names separated by ",".

  • Note that there is no "," behind the last species;
  • The first species is the central species;

Example: Species_1,Species_2,Species_3

🐳 Sequence Database Folder

With Species_1 being the central species, you will have the following folder structure:

<input_folder>
	+-- Species_1
	|   +-- PlantA.fa
	|   +-- PlantA.bed
	|   +-- PlantA_utr5.bed
	|   +-- PlantA_utr3.bed
	+-- Species_2
	|   +-- PlantA.fa
	|   +-- PlantA.bed
	|   +-- PlantA_utr5.bed
	|   +-- PlantA_utr3.bed
	|   +-- rbhSearch_result.txt
	+-- Species_3
	|   +-- PlantA.fa
	|   +-- PlantA.bed
	|   +-- PlantA_utr5.bed
	|   +-- PlantA_utr3.bed
	|   +-- rbhSearch_result.txt
	.
	.

See /cyverseZone/home/shared/cyverseuk/apples_testdata/apples_conservation_multiple/app_short for an example.

Checklist

Please check the followings in order to get correct results from the module:

✅ Apart from the main species (e.g. Species_1 in our example), all other species must have a rbhSearch_result.txt file which annotates the orthologs between itself and the main species. This file needs to have a total of 4 columns (tab-separated):

  • Column 1: Species 1's protein ID;
  • Column 2: Species 2's protein ID;
  • Column 3: Species 2's gene ID;
  • Column 4: Species 1's gene ID.

i.e. "Species_1_proteinID Species_2_proteinID Species_2_geneID Species_2_geneID". This is the format produced by the APPLES_rbh module.

✅ The gene IDs in your rbhSearch_result.txt must match those in your PlantA.fa file. If these don't match, the program will not produce any result.

Screenshot of APPLES_conservation_multiple on CyVerse DE

APPLES_rbh

The APPLES_rbh module finds Orthologs as Reciprocal Best Hits

Run APPLES_rbh on CyVerse

Version History
  • 1.0
Inputs
  • Protein FASTA of Species A
  • Protein FASTA of Species B

UTR Tool

The APPLES_utr module extracts sequences based on FASTA and GFF3 files of a species

Screenshot of APPLES_utr on CyVerse DE

Run APPLES_utr on CyVerse

Version History
  • 1.1-stable Added parallelisation option [fa9ebdd]
  • 1.0 Simple version adopted from Grannysmith
Inputs

For a Species X:

  • Gene FASTA* - This is the file from which you wish to extract your sequences from. Provided that you have the matching GFF3 annotation, this file may be genome, scaffold or others based.
  • GFF3* - This is the file which annotates the FASTA file.
  • Gene ID Identifier Text** - This is the text which prefixes the Gene ID in the 9th column of the GFF3 file. Check your GFF3 to see what goes here.
  • Sequence Length - The number of bases which you wish to extract upstream.
  • Stop at Neighbouring Gene - Check this if you wish the sequence extraction to stop at neighbouring gene.
  • Include the 5-prime UTR region - Check to start the upstream at TSS so that the sequence include the UTR region. Otherwise start at 5-prime.
* - Sequences of a species are queried from a pair of FASTA and GFF3 files. This requires that the Sequence IDs in both files to match. In the FASTA file, this is the ID following the `>` charactor in the description lines; in the GFF3 file, this is the value stored in the first column of the gene lines (i.e. lines that says "gene" in the 3rd column).
** - To understand the `Gene ID Identifier Text` works, here are a couple of examples:

Use "ID=" if your `gff3` file looks like this:
`Niben101Scf00059        maker   gene    513034  528469  .       +       .       ID=Niben101Scf00059g04019;Alias=maker-Niben101Scf00059-snap-gene-4.18`

Use "ID=gene:" if your `gff3` file looks like this:
`1       tair    gene    31170   33153   .       -       .       ID=gene:AT1G01050;Name=PPA1;biotype=protein_coding;description=Soluble inorganic pyrophosphatase 1 [Source:UniProtKB/Swiss-Prot%3BAcc:Q93V56];gene_id=AT1G01050;logic_name=tair`

Conservation Module

The APPLES_conservation module finds Non-Coding Conserved Regions

Screenshot of APPLES_conservation on CyVerse DE

Run APPLES_conservation on CyVerse

Inputs

There are three sections of inputs for the conservation module. The first two are identical to that of the utr module with each one being for one of the two species. In the third section:

  • Orthologs - A total of 4 columns (tab-separated) are required in this file. Column 1: Species A's protein ID; Column 2: Species B's protein ID; Column 3: Species B's gene ID; Column 4: Species A's gene ID. i.e. "SpeciesA_proteinID SpeciesB_proteinID SpeciesB_geneID SpeciesA_geneID". This is the format in which results from the APPLES_rbh module are produced.
  • Orthologs Mode - Results from the Pseudo-Orthologs option is used as a controlled result which is only useful when compared with the result produced by using the correspoinding (proper) orthologs. If you don't know what it means, please use the default mode.
  • Window Size - The Seaweed algorithm aligns substrings of the given sequences (the length of which are specified in each species's "Sequence Length" argument) at a time. The length of this substring is called the "Window Size". It is recommended to use one of these values: 30 / 60 (default) / 80 / 100
Parallelisation

Use this following command to split the orthologs file: split -d --number=l/$(nproc) rbhSearch_result_PlantA_PlantB.txt rbhSearch_result_PlantA_PlantB.txt

Accessibility

Similar to all of the CyVerse UK applications developed at Warwick. There are 3 options when it comes to using our applications:

  1. Via the CyVerse Discovery Environment. This is the recommended approach to a new user. This is the easiest option since a full user interface is provided to the user.
  2. Using the Docker images that are available on our Docker Hub repository 🐳. Each application/tool has a corresponding image.
  3. With the source codes that are hosted on our Github repository :octocat:. This approach will give you more information of how the application actually works. We are always looking to improve our code, so feel free to send us a pull request.

The modules related to APPLES can be searched on the CyVerse Discovery Environment using the "apples" keyword in the application search box as shown in this screenshot:

Search for APPLES on CyVerse DE

About

The APPLES :green_apple: software package is a set of tools to analyse promoter sequences on a genome-wide scale.

http://www2.warwick.ac.uk/fac/sci/dcs/people/sascha_ott/tools_and_software/apples


Languages

Language:Perl 6 59.9%Language:Perl 38.4%Language:Shell 0.7%Language:Python 0.6%Language:JavaScript 0.4%