ISCompare: a new tool for the identification of differentially located insertion sequences on bacteria
Bacterial genomes are composed by a core and an accessory genome. The first composed of housekeeping and essential genes, while the second is composed, in its majority, of mobile genetic elements, including transposable elements (Tes). Insertion sequences (ISs) are the smallest TEs formed by imperfect terminal inverted repeats (IRs) and a transposase. ISs are relevant because they have an important role in genome evolution, and contribute to bacterial genome plasticity and adaptability. ISs can spread in a genome, presenting different locations in nearly related strains, and producing phenotypic variations. We developed ISCompare to profile IS mobilization events in related bacterial strains. ISCompare uses blastn to look for ISs on a query genome assembly, extracts the IS flanks and maps them to the reference genome. After filtering and analysis steps, a list of differentially located ISs is reported. ISCompare was validated using artificial genomes with simulated random IS insertions and real sequences from Escherichia coli, Pseudomonas aeruginosa, Bordetella pertussis and Ensifer meliloti. In the first case, ISCompare performed very well, achieving high precision (100%) and sensitivity (94%). For real genomes a precision greater than 89% in average was observed, false positive arising mostly from consecutive IS insertions and repeated sequences. Finally we compared ISCompare with ISSeeker, achieving the same or better results, with the advantage that ISCompare can analyse multiple ISs at the same time, and direclty reports list of candidate DLIS. We think that ISCompare provides an easy and straightforward approach to look for differentially located ISs on bacterial genomes.
Paper at G3 Genes|Genomes|Genetics: link
Paper at BioRxiv: link
Protocol at protocols.io: link
ISCompare is implemented on python and has the following aditional requirements:
- Biopython (Tested with Biopython > v1.76)
- DNA_features_viewer (Tested with v3.0.3)
- Pandas (Tested with version > v1.0.3)
- Numpy (Tested with version > v1.18.4)
- mechanize (Tested with version > v0.4.5)
For blast to run correctly some environment variables need to be created:
- A modified path environment variable to indicate the location of installed blast+ programs
- A new BLASTDB environment variable as pointer to database location, with "e.g. blast_install_dir\db" as its value
- A new BLASTDB_LMDB_MAP_SIZE, with 1000000 as its value (needed to optimize makeblastdb operation when creating new database files)
If a low memory error is reported while running ISCompare with the -p option, then python 64bit should be installed.
Biopython: If you install biopython with pipwin, the error 'Bio module not found' could appear. In that case, go to you python install folder, \Lib\site-packages and verify that Biopython module folder is 'Bio' and not 'bio'.
To install ISCompare simply download ISCompare folder and install all the required third party modules and programs.
To run ISCompare open a terminal and type:
./ISCompare.py -h
usage: ISCompare.py [-h] [-q QFILES [QFILES ...]] [-r RFILES [RFILES ...]]
[-i ISFILE] [-I] [-Q QACC [QACC ...]] [-R RACC [RACC ...]]
[-e YOUREMAIL] [-E EVALUE] [-o OUTPUT] [-x]
[-l MINALNLENGTH] [-m MINLENGTH] [-s SURROUNDINGLEN]
[-s2 SURROUNDINGLEN2] [-S SHIFT] [-d ISDIFF]
[-f SCAFFOLDDIFF] [-p] [-c] [-rs]
ISCompare is a program designed to look for and compare insertion sequence
position between a query and a reference genomes (or WGS Assembly)..
optional arguments:
-h, --help show this help message and exit
-q QFILES [QFILES ...], --QueryFiles QFILES [QFILES ...]
Query genome in Genbank format...
-r RFILES [RFILES ...], --RefFiles RFILES [RFILES ...]
Reference genome in Genbank format...
-i ISFILE, --ISfile ISFILE
IS database file [Multifasta nucleotide file]...
-I, --ISscan Scan IS on query and reference genomes using ISFinder
[Generates IS.fna file in results folder]...
-Q QACC [QACC ...], --QueryACC QACC [QACC ...]
Accession numbers for the query genome to download
from NCBI.
-R RACC [RACC ...], --RefACC RACC [RACC ...]
Accession numbers for the reference genome to download
from NCBI.
-e YOUREMAIL, --email YOUREMAIL
User email. Required for accession number download
mode.
-E EVALUE, --evalue EVALUE
E-value cutoff for blastn search.
-o OUTPUT, --OutputDir OUTPUT
Output folder.
-x, --IDScaffolds OMIT the remove identical scaffolds steps.
-l MINALNLENGTH, --minAlnLength MINALNLENGTH
Minimal required length of the alignment of the QIFs
to the reference genome.
-m MINLENGTH, --minLength MINLENGTH
Minimum QIFS length to be considered in the analysis.
For complete genomes it should be < 2*surroundingLen.
For genome assemblies with scaffolds/contigs it should
be < surroundingLen.
-s SURROUNDINGLEN, --surroundingLen SURROUNDINGLEN
Nucleotides to extract from upstream and downstream IS
blast hits [QIFs].
-s2 SURROUNDINGLEN2, --surroundingLen2 SURROUNDINGLEN2
Nucleotides to extract from QIFs blast hits [RAFs].
-S SHIFT, --shift SHIFT
Shifts the start and end of the QIFs an specified
number of nucleotides from the IS.
-d ISDIFF, --ISdiff ISDIFF
ISdiff: minimal difference in nucloetids between qlen
and IS alingnment length (discard IS scaffolds).
-f SCAFFOLDDIFF, --scaffoldDiff SCAFFOLDDIFF
scaffoldDiff --> maximal difference in nucloetides for
two scaffolds to be considered identical.
-p, --plot Plot IS surroundings with genomic features in both
query and reference genomes...
-c, --clean Clean files...
-rs, --SLIS Report IS with the same location (SLIS)...
ISCompare requires query and reference genomes in genbank flat format, or its corresponding accession numbers, as input. An optional multifasta DNA file containing all the ISs sequences to be searched in the query and reference genomes can be supplied, otherwise use the -I option which will ISFinderBlast.py script to launch an IS search at ISFinder webpage (http://www-is.biotoul.fr) and download all the found IS sequences.
Optionally, a compilation of IS sequences from ISFinder database can be downloaded from here. ISCompare uses the following naming convention for ISs: ISname_ISgroup_ISfamily (Required for discrimination between related ISs).
To run with the default parameters type:
./ISCompare.py -q [query genbank file] -r [reference genbank file] -o [output folder] -c -I -p
ISsimulator.py script inserts a selected number of copies of the insertion sequence at randomly chosen genomic locations, and outputs modified genbank and fasta files, and a table with the location of the inserted ISs.
usage: ISsimulator.py [-h] -i INPUT -s IS [-o OUTPUT] [-dr DIRECTREPEATS]
[-n NINSERTIONS]
ISsimulator is a program designed to randomly insert a desired IS into a
provided genome.
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Genome in genbank format...
-s IS, --IS IS Insertion sequence.
-o OUTPUT, --Output OUTPUT
Output Genbank file name (e.g. name.gb).
-dr DIRECTREPEATS, --directRepeat DIRECTREPEATS
Direct repeat length [Default 0].
-n NINSERTIONS, --nisertions NINSERTIONS
Number of Is insertions [Default 30].
To introduce 30 IS at random positions in the query genome query.gb use:
./ISsimulator.py -i query.gb -s IS.fasta -o artificialGenome.gb -n 30
IS.fasta must contain only one IS sequence in fasta format.
- extractISbyName.py: This script can be used to extract a single IS from the IS database (multifasta file, IS.fna) by its name. Type
./extractISbyName.py -h
for usage information. - extractRefFeatures.py: This script can be used to extract all CDS features from a genbank file. Such file is required to run ISSeeker. Type
./extractRefFeatures.py -h
for usage information. - EMBL2GB.py: Converts EMBL genome files to genbank format. Type
./EMBL2GB.py -h
for usage information.