GI-Cluster is a program for detecting genomic islands (GIs) in a genome by consensus clustering on multiple features. It includes a sets of scripts for extracting GI-related features from a genome sequence, performing consensus clustering on the obtained feature matrix to get potential GIs, and visualizing the predictions.
Author: Bingxin Lu Affiliation : National University of Singapore E-mail : bingxin@comp.nus.edu.sg
Bingxin Lu and HonWai Leong (2018). GI-Cluster: detecting genomic islands via consensus clustering on multiple features. Journal of bioinformatics and computational biology, 16(03), 1840010.
- clustering -- scripts to perform clustering
- evalution -- data and script for evaluating different GI prediction tools
- feature -- scripts to extract GI-related features
- postprocess -- scripts for postprocessing
- segmentation -- scripts to segment large genome sequences into short intervals
- util -- scripts of general usage
- visualization -- scripts for visualizing predictions
- Gene_Prediction.sh -- scripts for predicting genes from genome sequences
- GI_Clustering.R -- scripts for running Consensus Clustering on genomic regions with extracted GI-related features
- GI_Comparison.sh -- scripts for visualizing predictions from different methods
- GI_Feature.sh -- scripts for extracting features related to genomic islands in the unit of genes
- GI_Segmentation.sh -- scripts for splitting a genome sequence into a set of segments
- GI-Cluster.sh -- the main program
- Segment_Feature.sh -- scripts for extracting features related to genomic islands in a genomic region
- install.sh -- sample commands to insall external tools and packages
- README.md -- this file
Unix-based systems.
Since GI-Cluster uses bash scripts to connect each step, it is not convenient to run the program on Windows system.
GI-Cluster is written with Python 2.7, R and bash. There are several external tools and packages that GI-Cluster depends on. See install.sh for sample commands.
- CodonW -- codon analysis
- Blast -- database search
- Hmmmer -- database search
- Circos -- visualization
- Infernal cmscan -- ncRNA prediction
- tRNAscan-SE -- tRNA prediction
- Repseek -- repeat detection
- COG software
- Prodigal -- gene prediction
- MJSD -- genome segmentation
- GI-SVM -- genome segmentation
- Alienhunter -- genome segmentation
Please follow related documenation if you want to use one of these programs.
- scipy -- used for computing chisquare
Required R packages should be installed automatically. If problems occur, please install manually.
- Download the source code of GI-Cluster and decompress.
- Install required tools and packages which are described above.
A FASTA file of genome sequence(s) (ended by '.fna', e.g. NC_010161.fna)
Input when using annotation files from NCBI ("$organism" refers to the name of the organism of interest):
- The required old NCBI files include:
- "$organism".fna, genomic DNA sequence,
- "$organism".ffn, gene sequence -- used for analyzing sequence composition,
- "$organism".faa, protein sequence -- used for analyzing gene function
- The required new NCBI files include:
- "$organism"_genomic.fna, genomic DNA sequence,
- "$organism"_cds_from_genomic.fna, gene sequence -- used for analyzing sequence composition,
- "$organism"_protein.faa, protein sequence -- used for analyzing gene function
Input when using custom annotation files:
- The required files include:
- "$organism".fna, genomic DNA sequence,
- "$organism".ffn, gene sequence -- used for analyzing sequence composition,
- "$organism".faa, protein sequence -- used for analyzing gene function
- "$organism".glist, gene locations -- a tsv file with 4 columns: ID, start, end, strand (e.g. 1 25 500 F)
Suppose the header of "$organism".ffn has a format as:
">C_RS25945 | C_RS25945 | tRNA/rRNA methyltransferase | 25:711 Reverse",
then one can use the following command to get "$organism".glist:
less "$organism".ffn | grep '^>' | cut -d'|' -f4 | sed 's/:/ /g' | awk '{if($3=="Forward") lcol="F"; else lcol="R"; print NR,$1,$2,lcol;}' | sed 's/ /\t/g' > "$organism".glist
There are multiple folders and files generated by the program.
- Folder $pred_prog -- folders containing gene predictions, gene features and predictions of genomic islands
- Folder $seg_prog -- folders containing segmentation results and segment features
- Folder boundary -- files containing predicted tRNAs and repeats in a genome
The predictions of genomic islands are obtained by consensus clustering on multiple features. They depend on several parameters, mainly including feature (the type of features to use for clustering, including "gc", "codon", "kmer", "content", "gc_kmer", "comp", "comp_content"), method (the clustering method), pFeature (the percentage of features to use for clustering), and rep (the number of replicated clusterings). See GI_Clustering.R for all the parameters used in consensus clustering. The corresponding GI predictions are in folders named after the four most import parameters. Suppose feature=comp_content, method=average, pFeature=1, rep=1, then:
- When running gene prediction, the directory including the final GI candidates is $output_dir/$pred_prog/$seg_prog/$feature/$method/$pFeature/$rep.
- When not running gene prediction, the directory including the final GI candidates is $output_dir/unannotated/$seg_prog/$feature/$method/$pFeature/$rep.
- The file name including the final GI candidates is merged_"$organism"_refined_GI.
The programs can be called by following commands.
gnome=BtribCIP105476
organism=NC_010161
pred_prog=prodigal
seg_prog=window
output_dir=/home/b/bingxin/genome/$gnome
prog_dir=/home/b/bingxin/GI-Cluster
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 > std_"$seg_prog"_"$pred_prog" 2>&1 &
gnome=BtribCIP105476
organism=NC_010161
pred_prog=none
seg_prog=window
output_dir=/home/b/bingxin/genome/$gnome
prog_dir=/home/b/bingxin/GI-Cluster
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -t 0 > std_"$seg_prog"_"$pred_prog"_unannotated 2>&1 &
gnome=Vibrio_cholerae_RC9_uid55789
organism=NZ_ACHX00000000
pred_prog=prodigal
seg_prog=window
output_dir=/home/b/bingxin/genome/incomplete/$gnome
prog_dir=/home/b/bingxin/GI-Cluster
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -b 1 > std_"$seg_prog"_"$pred_prog" 2>&1 &
gnome=Vibrio_cholerae_RC9_uid55789
organism=NZ_ACHX00000000
pred_prog=none
seg_prog=window
output_dir=/home/b/bingxin/genome/incomplete/$gnome
prog_dir=/home/b/bingxin/GI-Cluster
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 -b 1 -t 0 > std_"$seg_prog"_"$pred_prog"_unannotated 2>&1 &
When using the old NCBI annotation files, please use option ""-p ncbi_old", namely pred_prog=ncbi_old.
gnome=BtribCIP105476
organism=NC_010161
pred_prog=custom
seg_prog=window
output_dir=/home/b/bingxin/genome/$gnome
prog_dir=/home/b/bingxin/GI-Cluster
nohup /usr/bin/time sh $prog_dir/GI-Cluster.sh -s $prog_dir -o $output_dir -n "$organism" -m $seg_prog -p $pred_prog -d 16 > std_"$seg_prog"_"$pred_prog" 2>&1 &
Suppose the input files are in $output_dir, the names of the input genome file should be "$organism".fna.
- When "pred_prog=ncbi_old", the names of the input file should be "$organism".fna, "$organism".faa, "$organism".ffn.
- When "pred_prog=ncbi", the names of the input file should be "$organism".fna, "$organism"_protein.faa, "$organism"_cds_from_genomic.fna.
In term of running time, GI-Cluster is fast in most steps, except database searching and consensus clustering. It may take a long time to find novel genes when searching against COG databases.
There are multiple intermediate files generated. If you suspect some files are not correct, please delete them and rerun GI_Cluster.sh. Or else, the program will assume the current files are correct and go on to the next steps.