aswarren/grc

See the file named "LICENSE" for license information.

A. Warren and J. Setubal, “The Genome Reverse Compiler: an explorative annotation tool,” BMC Bioinformatics, vol. 10, no. 1, p. 35, 2009.

About GRC:

The Genome Reverse Compiler (GRC) is an automated annotation tool designed for prokaryotic genomes. The goal is to provide an open-source, easy-to-run, very efficient annotation program. In this initial version 1.0, GRC only annotates protein-coding genes.

In addition to the genome sequence, GRC requires a multiFASTA file with annotated genes. GRC will perform best when these genes are well annotated and come from organisms closely related to the target genome. GRC finds genes and assigns functional annotations based on sequence similarity. It incorporates an open source version of BLAST (FSA-BLAST)[1] to perform functional assignment.
First grc_orfs is run to generate all possible ORFs for the given genome. The translated sequences are then checked for sequence similarity against the user-specified database using FSA-BLAST. Some ORFs are then discarded and start sites adjusted based on overlap, BLAST score, length, and sequence content[2]. Resulting putative genes are assigned functions based on the annotation of their corresponding best BLAST hits.

Downloading, installing, and requirements:

The GRC is available to download as source code and comes with precompiled binaries on an intel x86 linux machine. If you are unable to execute the binaries provided then you must recompile. An install script is available under the scripts directory which will attempt to recompile each executable and place it in its appropriate location.
In order to use the automated install of GRC you must have Perl, g++, and GNU Make (or an equivalent) installed on your system. GRC has been successfully tested in a Linux environment. For Windows users it is also possible to run GRC in Cygwin as long as the above install requirements are met.
On a typical unix system you can decompress using the following commands:
gzip -d GRC.tar.gz
tar -xf GRC.tar

The install script 'install.pl' in the scripts directory will handle building and moving the binaries to the correct location. This script assumes that g++ and perl are installed.

Input formats: [WARNING! Mac style line breaks ^M are not supported}
GRC expects the input to be in a certain format.
The extensions for these formats are (*.faa *.ptt *.fna *.fasta *.CP *.goa). GRC expects that the file's extension match it's formatting, if it does not then GRC will most likely crash.

(*.fna) NCBI "nucleotide fasta" format. Currently this is the only format/extension that GRC supports for the genomic sequence. Also it is assumed that there is only one (prokaryotic) replicon per file.

(*.faa) NCBI "amino acid fasta" format. This file is used by GRC to create a sequence database that is used to annotate the genome of interest.

(*.ptt) NCBI "protein table" format. This file can be used in two different ways. (1) If it is provided in the specified database folder (see Running the program) then it will be automatically parsed and merged with the cooresponding *.faa file. This feature exists because the location of the function description is not standard in fasta headers. So the merging of the *.ptt function and the *.faa sequence will lead to more concise annotation than if left to *.faa parsing guesswork. (2) GRC has a compare feature (using the -r command line option) that gives an evaluation of the GRC annotation with respect to a designated reference file. The *.ptt file can be used to evaluate GRC's performance if the NCBI formats are in use.

(*.fasta) EMBL "amino acid fasta" format. This sequence format is like the *.faa format in creating sequence database.

(*.goa) EMBL "gene ontology annnotation" format. Similar to (*.ptt) this format can be used it two ways. (1) As an annotation file used in creating a database with *.fasta files. Again each *.goa file in the specified database directory (see below) will be merged with its cooresponding *.fasta file to create an annotation database. (2) This format can also be used in conjunction with the *.CP format to evaluate the performance of the GRC using the Gene Ontology[3]. This aspect of the GRC is explained in the "Evaluation, Using GO" section.

(*.CP) EMBL "chromosome table" format. This format is used only in the evaluation of GRC performance (-r command line option) when EMBL formats are in use.

*.goa, *.CP, and *.fasta files can be found at http://www.ebi.ac.uk/integr8

*.ptt, *.fna, and *.faa files can be found at http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi

Running the program:
Required parameters:
-g [target genome file]
-d [target annotation database]

In order to annotate a genome the user must provide the target genome and annotated, amino acid file(s) for proteins from closely related organisms. Currently the GRC only supports specific fasta formats. The annotation files (or annotation database) must be in fasta format with *.faa or *.fasta extensions. The genome sequence must be in fasta format with a *.fna extension.
There are two options in specifiying the files to be used for the annotation database (1) specify a directory containing the appropriate files or (2) specify a file containing all annotations that are to be used. If a directory is given the GRC will attempt to merge all files with the appropriate extensions to create an annotation database file, inside the 'DB' directory, named 'Automerge.faa'. For convenience, a fasta merging script is available under the scripts directory that will merge fasta files given as command line parameters (*!This file is used by GRC when automerging so do not remove it!*).

Example 1 for running GRC:
./GRCv1.0.pl -g AE008687.fna -d ./DBdir/

Example 2 for running GRC:
./GRCv1.0.pl -g AE008687.fna -d ./DBdir/DC3000.faa

Other options:
GRC uses the start codons specified in resources/StartCodons.txt to determine ORFs and likely TIS.

Other command line options include:

-m [Min. Length(nt)] The minimum length, in nucleotides, used for finding ORFs (default is 300). All ORFs found will be of at least this many nucleotides or greater.
-h [Num. BLAST hits] The number of BLAST hits to use in determining whether a gene exists (default is 10), its start site, and its function.
-r [reference annotation] A reference annotation provided by the user which GRC will use to evaluate the results of the annotation using the grc_compare test module (see below). This file should be a *.ptt file (NCBI) or a *.CP file (EMBL) with its corresponding *.goa file placed in the same directory.
-k Option to keep the name the blast results uniquely with date and time so that they will not be over written during the next run.
-p [letter codes] With additional value 'A' gives amino acid fasta file output, 'N' gives nucleotide fasta file, and 'T' gives tbl2asn output. e.g. "-p ANT"
tbl2asn: This free NCBI program can be used to generate ASN and GBK files.
-x [coverage value 0 to 1] Sets the % coverage, of the query and subject, for the alignment, that is necessary to assign function e.g. "-x 0.70"
-t [int table number] Specify which genetic code to use, with Table 11 (standard Bacterial code) as the default. Use any one of the translation tables found at http://www.ncbi.nlm.nih.gov/Taxonomy/. The translation table controls which stop codons to use.
-y [Gene Ontology obo file] This file must be specified for GRC to use options (f, a, n, c). When this file is specified the GRC will automatically check the GO terms provided in the database to see if any terms have been out dated and will update them with an appropriate warning message.
-f [ECode filter] e.g. 'IEA ISS' Any GO term assignments in the database that are based only on the evidence codes specified will not be used.
-a [GO cat.] e.g. 'mbc' Only assign GO terms from the specified GO category. 'm' for molecular function, 'b' for biological process, 'c' for cellular component.
-n [GO min. depth] The minimum depth a GO term must have to be assigned in annotation
-c Enable consensus annotations. Consensus annotations are made from a distribution of GO terms from the top alignments to an ORF.

Other features:
GRC supports multiple replicons in the input genome file in creating the annotation. However, tbl2asn output and the comparison feature do not currently support multiple replicons.

Output files & formats:
Annotation Results:
Currently folders are created under the GRC/resuts/ directory. These folders are named according to the organism and minimum gene length specified. Three tab-delimited results files are placed in this folder.
(1) *.Pos Gives the putative genes that are likely to exist and their resulting annotation (according to GRC methodology).
(2) *.Neg Gives the orfs that were eliminated from the long-orf output.
(3) KnockList.txt Provides a record of why ORFs were placed in the *.Neg file (overlap or EDR/Alignment score).

(*.Pos) This is the annotation file for the target genome. It contains information about the genes that GRC "believes" exists.

(*.Neg) These are putative genes generated by the grc_orfs component that were later rejected by GRC.

(KnockList.txt) This file is generated by the rejection phase of GRC as record of "who knocked out who" and is used by the compare component to provide detailed information in the *.compare and *.FNAnalysis files.

(*.tbl) Table file that can be used as tbl2asn input (see above).
(*.pep) Amino acid file that can be used as tbl2asn input.

Evaluation Section:

The test module allows the user to evaluate the performance on a particular genome by simply specifying the reference annotation used for comparison as a command line parameter. When a reference annotation is given, the GRC generates performance information relative to gene finding and functional assignment. The resulting output will not only allow the user to evaluate performance at a high level, but also the decisions made by the GRC on an ORF by ORF basis. This is done by giving the output ORF’s relevant information paired with the reference ORF that led to its evaluation. We pair each true positive, false positive, and false negative ORF with the reference gene it overlaps with most. This enables the user to view the nature of the agreement/disagreement that is found in the comparison with a reference genome. When Gene Ontology is used, this record also includes term by term information with respect to how a GO annotation came to be classified as confirmed, compatible, or incompatible. It is thought that in this context the information provided will be valuable to researchers who wish to review the reference annotation with respect to the evidence introduced through the GRC.

As part of the comparison the GRC also creates a record of ORF removal (FNAnalysis.txt). This record details the conditions that led to the removal of ORFs deemed false negatives. This enables the user to evaluate the case where the GRC removes an ORF that the reference annotation asserts is a protein coding gene. This record indicates whether false negatives were removed due to their high EDR value or whether they were removed to resolve an overlap with another ORF thought to be protein coding. In the case of overlap we provide the information generated by GRC for both ORFs from its annotation of the target organism and the information from their respective reference ORFs. With this combined view, it is possible to both review the decision made by the GRC and to evaluate reference annotation itself. For example, if an FN is removed due to significant overlap with a TP or FN, then it is possible that this overlap was missed in the original reference annotation. Also, if a high-scoring FP occupies the same space as a low-scoring FN, it is possible that new information is being introduced that was not considered or not available during the annotation of the reference genome.

(*.compare) This file is the generated by the compare component (command option -r) of GRC. It gives statistics of GRC's performance with respect to the reference file and classifies putative genes as being True Positives (TP- sequences that are correctly put forward as genes), False Positives (FP- sequences that are incorrectly put forward as genes), (TN- sequences that were correctly rejected as genes), and False Negatives (FN- sequences that were incorrectly rejected).

(FNAnalysis) This file is generated by the compare component (command option -r) of GRC. It provides information with respect to the False Negatives (those genes that, according to the reference file, were erroneously rejected) more on this in the "Evaluation, Using GO" section.

Using NCBI Blast:
To use NCBI Blast installed on your machine instead of the provided fsa_blast:
Comment out line: chdir("$blastdir");
Replace: $status = system("$blastdir"."formatdb $DBFile");
With: $status = system("formatdb -p T -i $DBFile");
Replace: print "./blast -d $DBFile -e .001 -i $tempdir$transeqout -m 8 -v 1 -b $NumBHits -M $Matrix -z $DBSize >"."$BHName\n";
With: print "blastall -d $DBFile -e .001 -i $tempdir$transeqout -m 8 -v 1 -b $NumBHits -p blastp -M $Matrix -z $DBSize -o $BHName";
Replace: $status = system("./blast -d $DBFile -e .001 -i $tempdir$transeqout -m 8 -v 1 -b $NumBHits -M $Matrix -z $DBSize >"."$BHName");
With: $status= system("blastall -d $DBFile -e .001 -i $tempdir$transeqout -m 8 -v 1 -b $NumBHits -p blastp -M $Matrix -z $DBSize -o $BHName");

Reusing Blast output:
To reuse BLAST output you must use the -k option in order to have blast output copied to the results directory.
Then use "-b filename.bh" to reuse the *.bh file from a previous run.
Warning: This option will expect your BLAST output to be compatible with the DB that you specify in order to lookup the annotations and merge them into a format that GRC can use. If you switch databases and then try to reuse old BLAST results this could have dire consequences.

References:
[1] Cameron M, Williams HE, Cannane A. A deterministic finite automaton for faster protein hit detection in BLAST. J Comput Biol. 2006 May;13(4):965-78.
[2] Ouyang Z, Zhu H, Wang J, She ZS. Multivariate entropy distance method for prokaryotic gene identification. J Bioinform Comput Biol. 2004 Jun;2(2):353-73
[3] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

aswarren / grc

About

Languages