INTRODUCTION

GI-Cluster is a program for detecting genomic islands (GIs) in a genome by consensus clustering on multiple features. It includes a sets of scripts for extracting GI-related features from a genome sequence, performing consensus clustering on the obtained feature matrix to get potential GIs, and visualizing the predictions.

Author: Bingxin Lu Affiliation : National University of Singapore E-mail : bingxin@comp.nus.edu.sg

CITATION

Bingxin Lu and HonWai Leong (2018). GI-Cluster: detecting genomic islands via consensus clustering on multiple features. Journal of bioinformatics and computational biology, 16(03), 1840010.

Content of GI-Cluster program

clustering -- scripts to perform clustering
evalution -- data and script for evaluating different GI prediction tools
feature -- scripts to extract GI-related features
postprocess -- scripts for postprocessing
segmentation -- scripts to segment large genome sequences into short intervals
util -- scripts of general usage
visualization -- scripts for visualizing predictions
Gene_Prediction.sh -- scripts for predicting genes from genome sequences
GI_Clustering.R -- scripts for running Consensus Clustering on genomic regions with extracted GI-related features
GI_Comparison.sh -- scripts for visualizing predictions from different methods
GI_Feature.sh -- scripts for extracting features related to genomic islands in the unit of genes
GI_Segmentation.sh -- scripts for splitting a genome sequence into a set of segments
GI-Cluster.sh -- the main program
Segment_Feature.sh -- scripts for extracting features related to genomic islands in a genomic region
install.sh -- sample commands to insall external tools and packages
README.md -- this file

SYSTEM REQUIREMENTS

Unix-based systems.

Since GI-Cluster uses bash scripts to connect each step, it is not convenient to run the program on Windows system.

SOFTWARE REQUIREMENTS

GI-Cluster is written with Python 2.7, R and bash. There are several external tools and packages that GI-Cluster depends on. See install.sh for sample commands.

Prebuilt database

PFAM
RFAM
PHAST
VFDB
CARD
Download CARD Data on the webpage https://card.mcmaster.ca/download.
COG

External tools

Required

CodonW -- codon analysis
Blast -- database search
Hmmmer -- database search
Circos -- visualization
Infernal cmscan -- ncRNA prediction
tRNAscan-SE -- tRNA prediction
Repseek -- repeat detection
COG software

Optional

Prodigal -- gene prediction
MJSD -- genome segmentation
GI-SVM -- genome segmentation
Alienhunter -- genome segmentation

Please follow related documenation if you want to use one of these programs.

Python packages to install

scipy -- used for computing chisquare

R packages to install

Required R packages should be installed automatically. If problems occur, please install manually.

INSTALLATION

Download the source code of GI-Cluster and decompress.
Install required tools and packages which are described above.

USAGE

INPUT FILE

A FASTA file of genome sequence(s) (ended by '.fna', e.g. NC_010161.fna)

Input when using annotation files from NCBI ("$organism" refers to the name of the organism of interest):

The required old NCBI files include:

"$organism".fna, genomic DNA sequence,
"$organism".ffn, gene sequence -- used for analyzing sequence composition,
"$organism".faa, protein sequence -- used for analyzing gene function

The required new NCBI files include:

"$organism"_genomic.fna, genomic DNA sequence,
"$organism"_cds_from_genomic.fna, gene sequence -- used for analyzing sequence composition,
"$organism"_protein.faa, protein sequence -- used for analyzing gene function

Input when using custom annotation files:

The required files include:

"$organism".fna, genomic DNA sequence,
"$organism".ffn, gene sequence -- used for analyzing sequence composition,
"$organism".faa, protein sequence -- used for analyzing gene function
"$organism".glist, gene locations -- a tsv file with 4 columns: ID, start, end, strand (e.g. 1 25 500 F)

Suppose the header of "$organism".ffn has a format as: ">C_RS25945 | C_RS25945 | tRNA/rRNA methyltransferase | 25:711 Reverse", then one can use the following command to get "$organism".glist:
less "$organism".ffn | grep '^>' | cut -d'|' -f4 | sed 's/:/ /g' | awk '{if($3=="Forward") lcol="F"; else lcol="R"; print NR,$1,$2,lcol;}' | sed 's/ /\t/g' > "$organism".glist

OUTPUT FILE

There are multiple folders and files generated by the program.

Folder $pred_prog -- folders containing gene predictions, gene features and predictions of genomic islands
Folder $seg_prog -- folders containing segmentation results and segment features
Folder boundary -- files containing predicted tRNAs and repeats in a genome

The predictions of genomic islands are obtained by consensus clustering on multiple features. They depend on several parameters, mainly including feature (the type of features to use for clustering, including "gc", "codon", "kmer", "content", "gc_kmer", "comp", "comp_content"), method (the clustering method), pFeature (the percentage of features to use for clustering), and rep (the number of replicated clusterings). See GI_Clustering.R for all the parameters used in consensus clustering. The corresponding GI predictions are in folders named after the four most import parameters. Suppose feature=comp_content, method=average, pFeature=1, rep=1, then:

When running gene prediction, the directory including the final GI candidates is $output_dir/$pred_prog/$seg_prog/$feature/$method/$pFeature/$rep.
When not running gene prediction, the directory including the final GI candidates is $output_dir/unannotated/$seg_prog/$feature/$method/$pFeature/$rep.
The file name including the final GI candidates is merged_"$organism"_refined_GI.

Running

The programs can be called by following commands.

Run on complete genome with gene predictions

gnome=BtribCIP105476   
organism=NC_010161  
pred_prog=prodigal  
seg_prog=window 
output_dir=/home/b/bingxin/genome/$gnome  
prog_dir=/home/b/bingxin/GI-Cluster 
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 > std_"$seg_prog"_"$pred_prog" 2>&1 &

Run on complete genome without gene predictions

gnome=BtribCIP105476  
organism=NC_010161  
pred_prog=none  
seg_prog=window 
output_dir=/home/b/bingxin/genome/$gnome  
prog_dir=/home/b/bingxin/GI-Cluster 
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -t 0 > std_"$seg_prog"_"$pred_prog"_unannotated 2>&1 &

Run on incomplete genome with gene predictions

gnome=Vibrio_cholerae_RC9_uid55789  
organism=NZ_ACHX00000000  
pred_prog=prodigal  
seg_prog=window 
output_dir=/home/b/bingxin/genome/incomplete/$gnome 
prog_dir=/home/b/bingxin/GI-Cluster 
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -b 1 > std_"$seg_prog"_"$pred_prog" 2>&1 &

Run on incomplete genome without gene predictions

gnome=Vibrio_cholerae_RC9_uid55789  
organism=NZ_ACHX00000000  
pred_prog=none  
seg_prog=window 
output_dir=/home/b/bingxin/genome/incomplete/$gnome 
prog_dir=/home/b/bingxin/GI-Cluster 
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -b 1 -t 0 > std_"$seg_prog"_"$pred_prog"_unannotated 2>&1 &

When using the old NCBI annotation files, please use option ""-p ncbi_old", namely pred_prog=ncbi_old.

Run on complete genome with custom gene predictions

gnome=BtribCIP105476   
organism=NC_010161  
pred_prog=custom  
seg_prog=window 
output_dir=/home/b/bingxin/genome/$gnome  
prog_dir=/home/b/bingxin/GI-Cluster 
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 > std_"$seg_prog"_"$pred_prog" 2>&1 &

Note:

Suppose the input files are in $output_dir, the names of the input genome file should be "$organism".fna.

When "pred_prog=ncbi_old", the names of the input file should be "$organism".fna, "$organism".faa, "$organism".ffn.
When "pred_prog=ncbi", the names of the input file should be "$organism".fna, "$organism"_protein.faa, "$organism"_cds_from_genomic.fna.

In term of running time, GI-Cluster is fast in most steps, except database searching and consensus clustering. It may take a long time to find novel genes when searching against COG databases.

There are multiple intermediate files generated. If you suspect some files are not correct, please delete them and rerun GI_Cluster.sh. Or else, the program will assume the current files are correct and go on to the next steps.

icelu / GI_Cluster