finder
is a gene annotator pipeline which automates the process of downloading short reads, aligning them and using the assembled transcripts to generate gene annotations. Additionally it uses protein sequences and reports gene predictions by BRAKER2
. It is a fast, scalable, platform independent software that generatess gene annotations in GTF format. finder
accepts inputs through command line interface. It finds several novel genes/transcripts and also reports the tissue/conditions they were found to be in. If you use finder
for your research please cite
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z Sen,Roger P Wise, and Carson M Andorf. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences BMC Bioinformatics
finder
requires a number of softwares which needs to be installed. This might cause version conflicts with softwares that are already installed in your system. Hence, the developers have decided to enforce the use of finder
within a conda environment.
Execute the following commands to download and install Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
bash Anaconda3-2020.11-Linux-x86_64.sh # do not install VS. You may replace it with the latest version of Anaconda from https://www.anaconda.com/distribution/. Also, conda will default to the home directory, but make sure you choose a directory that has sufficient disk space to install all the software packages.
git clone https://github.com/sagnikbanerjee15/finder.git
cd Finder
conda env create -f environment.yml # This will create an environment named finder_conda_env
conda activate finder_conda_env
cd dep
If you want to avoid using git clone
and decide to download the zip package then please follow the following steps
wget https://github.com/sagnikbanerjee15/Finder/archive/master.zip
unzip master.zip
mv Finder-master Finder
cd Finder
conda env create -f environment.yml # This will create an environment named finder_conda_env
conda activate finder_conda_env
cd dep
While running conda env create -f environment.yml
you might get an error about conflicts. Follow the following steps to get rid of that error.
cat /dev/null > ~/.condarc
vim ~/.condarc
# Copy paste the following part
channels:
- defaults
- conda-forge
channel_priority: true
conda update --all
conda env create -f environment.yml # This will create an environment named finder_conda_env
conda activate finder_conda_env
cd dep
finder
runs BRAKER2
which depends on GeneMark-ES
. finder
also needs GeneMarkS/T
to predict coding regions of genes. Both GeneMark-ES
and GeneMarkS/T
are hosted at the University of Georgia website. The license prohibits the redistribution of their software, which is why it could not be included in this package. Hence, users have to manually download these 2 softwares and place them under the dep
sub-directory. Please follow the instructions below to download the softwares and the key:
- Open a browser of your choice
- Go to this website
- Select the option GeneMark-ES/ET/EP ver 4.62_lic (2nd from top) and LINUX 64
- Enter your name, institution, country and email-id and click on the button that says I agree to the terms of this license agreement
- Right click on the link that says Please download program here and select Copy Link Address
- Then type in
wget
and paste the path you just copied - This command will download the file gmes_linux_64.tar.gz in the
dep
sub-directory - Go to step 2
- Select the option GeneMarkS-T (last option) and LINUX 64
- Repeat steps 4-6
- This command will download the file gmst_linux_64.tar.gz in the
dep
sub-directory - Now, right click on the link that says 64_bit and select Copy Link Address
- Then type in
wget
and paste the path you just copied - This command will download the file gm_key_64.tar.gz in the
dep
sub-directory. This license key will serve for both the programs. Please note that this key will expire after one year from the date of download. - Execute the following commands:
cd ..
./install.py
echo "export PATH=\$PATH:$(pwd)" >> ~/.bashrc # Add this path permanently to the bashrc file
export PATH=$PATH:$(pwd)
Please follow the following the instructions to generate gene annotations using Arabidopsis thaliana. A csv
file template has been provided with the release in example/Arabidopsis_thaliana_metadata.csv
. Keep all the headers intact and replace the data with your samples of choice. Also note, that FINDER can work with both data downloaded from NCBI
and also with data on local directories. Below is a detailed description of the each column of the metadata file. All the fields must be present in the metadata file. Mandatory fields must have some valid data whereas other fields like Description
, Date
and Read Length
can be left vacant.
Column Name | Column Description | Mandatory |
---|---|---|
BioProject | Name of the bioproject that the data belongs to. If you are using locally saved data then please enter a dummy project name. Please note that FINDER will NOT be able to process empty fields of Bioproject. | YES |
SRA Accession | Enter the SRA Accession number of the sample that you expect finder to use for generating the gene annotations. Note that FINDER will use this ID to download the read samples from NCBI-SRA. In case you wish to use data which is not currently uploaded to NCBI, then you should enter the name of the local file. Do not enter any file extension in this field. For example, if your filename is sample1.fastq , please enter sample1 in this field. finder assumes all files have the extension fastq. If there are files in your system that end with f.q please rename those to *.fastq . For paired-ended samples do not include the pair information in this field. For example, if you have 2 files sample2_1.fastq and sample2_2.fastq please enter sample2 in this field. |
YES |
Tissues | Mention the tissue type or condition from which the sample has been collected. finder will report the tissues that are associated with a particular transcript. This can be used to find gene models that are expressed in a specific tissue and/or condition |
YES |
Description | A brief description of the data. This field is not mandatory and is not used by finder . It is upto the user to enter whatever metadata is deemed important. |
NO |
Date | Enter the date of producing the RNA-Seq sample. This field is not mandatory and is not used by finder . |
NO |
Read Length (bp) | Enter the length of the reads. This field is not mandatory and is not used by finder . |
NO |
Ended | Enter either PE or SE for Paired ended reads or single neded reads. No other value should be entered. | YES |
RNA-Seq | Enter 1 for all the rows. This field is included for future extensions. | YES |
process | Enter 1 if you wish to process the sample. If a value of 0 is present, then finder will ignore the sample |
YES |
Location | Enter the location of the directory. For samples to be downloaded from NCBI, this field should be left empty. If the location of a directory is provided here then finder will assume that the sample is present in it. finder will generate an error if the sample is not found in this directory. It is not necessary to have all the samples in the same directory. |
YES |
To optimize disk space usage finder
will process read samples from each bioproject at a time. Once the data is downloaded and reads are mapped, FINDER will remove all those data (if -no-cleanup
is not specificied) to save disk space. But samples that were locally present will not be removed.
finder
uses STAR
and OLego
for aligning and PsiCLASS
for assembling. Users have the choice of generating the genome index and providing it to finder
or allowing finder
to generate the index. For some organisms with large genomes, STAR
might require more memory. Hence users might be forced to transfer the STAR genome index generation to a computer that permits the usage of a large enough memory. Download the Arabidopsis thaliana genome using the following command:
cd example
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-49/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz
Unzip the genome using this command
gunzip Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz
This command will generate the file Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa
which contains the entire genome of Arabidopsis thaliana
STAR genome index can be created by the following command
CPU=30 #Enter the number of CPUs that you are permitted to use
mkdir star_index_without_transcriptome
STAR --runMode genomeGenerate --runThreadN $CPU --genomeDir star_index_without_transcriptome --genomeSAindexNbases 12 --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa
Similarly olego index can be generated by the following command
../dep/olego/olegoindex -p olego_index Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa
Help menu for FINDER can be launched by the following command:
finder -h
usage: finder [-h] --metadatafile METADATAFILE --output_directory
OUTPUT_DIRECTORY --genome GENOME [--cpu CPU]
[--genome_dir_star GENOME_DIR_STAR]
[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE]
[--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data]
[--checkpoint CHECKPOINT]
[--perform_post_completion_data_cleanup] [--run_tests]
[--addUTR] [--skip_cpd]
Generates gene annotation from RNA-Seq data
optional arguments:
-h, --help show this help message and exit
Required arguments:
--metadatafile METADATAFILE, -mf METADATAFILE
Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject, SRA Accession, Tissues, Description, Date, Read Length, Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file.
--output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY
Enter the name of the directory where all other operations will be performed
--genome GENOME, -g GENOME
Enter the SOFT-MASKED genome file of the organism
Optional arguments:
--cpu CPU, -n CPU Enter the number of CPUs to be used.
--genome_dir_star GENOME_DIR_STAR, -gdir_star GENOME_DIR_STAR
Please enter the location of the genome index directory of STAR
--genome_dir_olego GENOME_DIR_OLEGO, -gdir_olego GENOME_DIR_OLEGO
Please enter the location of the genome index directory of OLego
--verbose VERBOSE, -verb VERBOSE
Enter a verbosity level
--protein PROTEIN, -p PROTEIN
Enter the protein fasta
--no_cleanup, -no_cleanup
Provide this option if you do not wish to remove any intermediate files. Please note that this will NOT remove any files and might take up a large amount of space
--preserve_raw_input_data, -preserve
Set this argument if you want to preserve the raw fastq files. All other temporary files will be removed. These fastq files can be later used.
--checkpoint CHECKPOINT, -c CHECKPOINT
Enter a value if you wish to restart operations from a certain check point. Please note if you have new RNA-Seq samples, then FINDER will override this argument and computation will take place from read alignment. If there are missing data in any step then also FINDER will enforce restart of operations from a previous checkpoint. For example, if you wish to run assembly on samples for which alignments are not available then FINDER will readjust this value and set it to 1.
1. Align reads to reference genome (Will trigger removal of all alignments and start from beginning)
2. Assemble with PsiCLASS (Will remove all assemblies)
3. Find genes with FINDER (entails changepoint detection)
4. Predict genes using BRAKER2 (Will remove previous results of gene predictions with BRAKER2)
5. Annotate coding regions
6. Merge FINDER annotations with BRAKER2 predictions and protein sequences
--perform_post_completion_data_cleanup, -pc_clean
Set this field if you wish to clean up all the intermediate files after the completion of the execution. If this operation is requested prior to generation of all the important files then it will be ignored and finder will proceed to annotate the genome.
--run_tests, -rt Modify behaviour of finder to accelerate tests. This will reduce the downloaded fastq files to a bare minimum and also check the other installations
--addUTR, --addUTR Turn on this option if you wish BRAKER to add UTR sequences
--skip_cpd, --skip_cpd
Turn on this option to skip changepoint detection. Could be effective for grasses
--exonerate_gff3 EXONERATE_GFF3, -egff3 EXONERATE_GFF3
Enter the exonerate output in gff3 format
finder
can be launched using the following command:
finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error
This program will download and run the entire process of annotation. The duration of execution will depend on your internet speed and the number of cores you assigned to FINDER. Also, FINDER is designed in a way to handle a large number of RNA-Seq samples. So the speedup might not be noticeable with just a few samples.
Run the following command to remove all intermediate files. We recommend that while you run finder
, you preserve all intermediate files and then run the following command to remove all the intermediate files.
finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve -pc_clean 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error
finder
allows users to enforce execution from a specific checkpoints. Requesting a particular checkpoint does not mean that finder
will skip all previous steps. It means that finder
will remove all files generated by process after the checkpoint to ensure that the modules recalculate those. Below is a description of all the checkpoints that finder
can accept:
- Align reads to reference genome - Requesting
finder
to start from this checkpoint will trigger removal of all previous alignments. - Assemble with
PsiCLASS
- Requestingfinder
to start from this checkpoint will trigger removal of assemblies that was previously generated. Aligned files will not be removed. If there are some RNA-Seq samples that are not aligned FINDER will align those first before attempting to assemble them - Find genes with
finder
-finder
will regenerate all files post assembly by PsiCLASS - Predict genes using
BRAKER2
-finder
will rerun the BRAKER2 step - Annotate coding regions -
finder
will restart from annotating the coding sequences - Merge
finder
annotations withBRAKER2
predictions and protein sequences -finder
will generate merged annotations from RNA-Seq samples, predictions and protein sequences
If you wish to start finder
from downloading the SRA samples, please delete the output directory and start over.
All relevant output files generated by finder
can be found in the final_GTF_files
directory under the output directory. Below is the list of files and what data they contain
- braker.gtf - gene models generated by
BRAKER2
- braker_utr.gtf - gene models, with UTR models, generated by
BRAKER2
- combined_redundant_transcripts_removed.gtf - GTF file from PsiCLASS output
- combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf - GTF file after splitting transcripts. This file is generated only from RNA-Seq expression evidence
- combined_with_CDS.gtf -
finder
output with CDS predicted by GeneMark-S/T - combined_with_CDS_high_conf.gtf-
finder
gene models with high confidence - combined_with_CDS_low_conf.gtf-
finder
gene models with low confidence - combined_with_CDS_BRAKER_appended_high_conf.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
- combined_with_CDS_high_and_low_confidence_merged.gtf - High and Low confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
- FINDER_BRAKER_PROT.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models and gene models from protein evidence
- tissue/condition to transcript - A file with two columns. the first column lists the transcripts and the seconds column lists the tissues/conditions they were found in. In future versions, we will include the functionality of extracting transcripts specific to a tissue/condition.
finder
generates several intermediate files and folders. This section contains a detailed outline of the contents of each folder and what each file represents.
finder
is configured to output information to a log file location in the output directory named progress.log
. While reporting issues please make sure you attach the log file.
finder
offeres users the opportunity to augment data into already completed annotation runs. Users need to update the metadata.csv file with the new RNA-Seq data and rerun finder
. The program will determine an optimal starting point. finder
will skip downloading of already processed RNA-Seq samples and will proceed with the new data. Users also have the option of removing some previously supplied RNA-Seq samples.
finder
offers users with 2 utilites which could be used independently.
-
downloadAndDumpFastqFromSRA.py
- A python program that optimizes the download of data from SRA. Ids of RNA-Seq (or any sequencing for that matter) needs to be provided as a newline separated file. The program will download the RNA-Seq files, using the requested number of cores, convert those to fastq and remove the.sra
files.downloadAndDumpFastqFromSRA.py
will continuosly query the SRA database in the event of a failure.python downloadAndDumpFastqFromSRA.py -h usage: download_and_dump_fastq_from_SRA.py [-h] --sra SRA --output OUTPUT [--cpu CPU] Parallel download of fastq data from NCBI. Program will create the output directory if it is not present. If fastq file is present, then downloading is skipped. Program optimizes downloading of sra files and converting to fastq by utilizing multiple CPU cores. optional arguments: -h, --help show this help message and exit --sra SRA, -s SRA Please enter the name of the file which has all the SRA ids listed one per line. Please note the bioproject IDS cannot be processed --output OUTPUT, -o OUTPUT Please enter the name of the output directory. Download will be skipped if file is present --cpu CPU, -n CPU Enter the number of CPUs to be used.
-
verifyInputsToFINDER.py
- This program will verify whether all the resuested samples are in fact from a transcriptomic source of the organism whose genome is being annotated.python verifyInputsToFINDER.py -h usage: verify_inputs_to_finder.py [-h] --metadatafile METADATAFILE --srametadb SRAMETADB --taxon_id TAXON_ID Verifies whether all the data are transcriptomic and from the organism under consideration optional arguments: -h, --help show this help message and exit --metadatafile METADATAFILE, -mf METADATAFILE Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject,Run,tissue_group,tis sue,description,Date,read_length,ended (PE or SE),directorypath,download,skip. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file. --srametadb SRAMETADB, -m SRAMETADB Enter the location of the SRAmetadb file. --taxon_id TAXON_ID, -t TAXON_ID Enter the taxonomic id of the organism. Enter -1 if you are working on a non-model organism or a sub- species for which no taxonomic id exists.
MIT License
Copyright (c) [2021] [Banerjee]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Please report all issues here.