cmkobel / kofam_scan

CLI tool to annotate genes with KOfam

Home Page:https://www.genome.jp/tools/kofamkoala/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KofamScan

KofamScan is a gene function annotation tool based on KEGG Orthology and hidden Markov model. You need KOfam database to use this tool. Online version is available on https://www.genome.jp/tools/kofamkoala/ .

Requirements

  • Linux
  • Ruby >= 2.4
  • HMMER >= 3.1
  • GNU Parallel

If you wish to use Conda for managing dependencies, you can use the bundled environment.yml file to install these dependencies with conda env create -n kofam_scan -f environment.yml.

Usage

  1. Download KOfam database from ftp://ftp.genome.jp/pub/db/kofam/ and decompress it. You will get profile HMMs in profiles/ directory and ko_list.
  2. Create config.yml in the same directory as exec_annotation script. See below for details.
  3. Execute exec_annotation.
$ ./exec_annotation -o result.txt query.fasta

Query file

A query file is a FASTA file with one or more amino acid sequences. You cannot use nucleotide sequences. Each sequence must have a unique name. A name of a sequence is a string between the header symbol (">") and the first blank character (whitespace, tab, line break, etc.). Do not put a whitespace right after ">".

Profiles

Specify the path of the profile database you downloaded by giving --profile option to the command or writing it to config.yml. The path can be a directory, .hmm file, or .hal file. If it is a directory, .hmm files in the directory will be used. If a .hmm file, only the file will be used. If a .hal file, files listed in the .hal file will be used. File paths in a .hal file are either absolute or relative to the directory of the file. Lines start with # are ignored.

KOfam has prokaryote.hal and eukaryote.hal in profiles directory. They are lists of profiles excluding eukaryote- and prokaryote-specific KOs respectively. If you are interested in only several KOs, you can make your original .hal file and use it as a database. It will reduce computation time.

Options

  • -o FILE
    • The result are output to FILE. It defaults to stdout.
  • -p, --profile=PROFILE
    • Use PROFILE as a profile database. See Profiles
  • -k, --ko-list=FILE
    • Use FILE as a KO list.
  • --cpu=N
    • Set the number of hmmsearch processes started simultaneously to N. It defaults to 1 unless it is set in config.yml.
  • -c FILE
    • Use FILE as a config file instead of config.yml in the same directory as exec_annotation.
  • --tmp-dir=DIR
    • Use DIR as a temporary directory where hmmsearch results are. It will be created if not exist. It defaults to ./tmp.
  • -E, --e-value=VALUE
    • Require E-value to be smaller than or equal to VALUE. If not, an asterisk will not be added in detail format or the hit will not be reported in other formats.
  • -T, --threshold-scale=VALUE
    • The score thresholds are multiplied by VALUE. For example, with -T2 option, the thresholds become twice as strict.
  • -f, --format=FORMAT
    • Set the format of the output to FORMAT. Three formats below are available.
    • detail
      • Default format. Gene name, assigned K number, threshold of the KO, hmmsearch score and E-value, and the definition of KO are shown. In addition, an asterisk '*' is added to the head of the line if the score is higher than the threshold.
    • detail-tsv
      • Tab separated values for detail format.
    • mapper
      • Format which can be used for KEGG Mapper input. It includes a gene name and an assigned K number separated by a tab. Here, an assigned K number represents a hit with score above the predefined threshold. Note that for some KOs, predefined score thresholds are not available when they are represented by a very few number of sequences in KEGG GENES.
    • mapper-oneline
      • Similar to mapper, but when more than one KO are assigned to a gene, all assigned KO are shown in one line separated by tabs.
  • --[no-]report-unannotated
    • With --report-unannotated option, gene names are shown even when no KO is assigned (default when --format=mapper(-oneline)). With --no-report-unannotated such genes are not shown at all (default when --format=detail).
  • --create-alignment
    • hmmsearch's normal outputs per profile are stored in the temporary directory. In addition, domain information and alignments in the outputs will be rearranged per query.
    • Not compatible with --reannotation
  • -r, --reannotation
    • Skip hmmsearch and assume that hmmsearch outputs are already in the temporary directory. This will help you to make an output in a different format or redo annotation changing thresholds.
    • Not compatible with --create-alignment
  • -h, --help
    • Show brief help message.

config.yml

The following variables can be set by config.yml.

  • profile
    • Path to KOfam profiles.
    • --profile option takes precedence.
  • ko_list
    • Path to the KO list of KOfam.
    • --ko-list option takes precedence.
  • cpu
    • Number of hmmsearch processes started simultaneously.
    • --cpu option takes precedence.
  • hmmsearch
    • Path to hmmsearch executable. If not given, it will be searched for PATH.
  • parallel
    • Path to parallel executable. If not given, it will be searched for PATH.

License

This software is released under the MIT License.

About

CLI tool to annotate genes with KOfam

https://www.genome.jp/tools/kofamkoala/

License:MIT License


Languages

Language:Ruby 99.8%Language:Emacs Lisp 0.2%