jrherr / mlst

Scan contig files against PubMLST typing schemes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Scan contig files against PubMLST typing schemes.


The main Perl script and the included databases:

% cd $HOME
% git clone https://github.com/Victorian-Bioinformatics-Consortium/mlst.git

The only external dependency is the BLAT tools written by Jim Kent: http://genome.ucsc.edu/FAQ/FAQblat.html#blat3

% cd $HOME/mlst/bin

# Linux:
% wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/blat
# MacOSX:
% wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/blat

% chmod +x blat

Add it to your PATH:

% echo "export PATH=$PATH:$HOME/mlst/bin" >> $HOME/.bashrc

Now log out, and log back in, and test that it works:

% mlst -h
% blat



###Available schemes

To see which PubMLST schemes are supported:

% mlst --list

abaumannii achromobacter aeromonas afumigatus	cdifficile efaecium
haemophilus	hcinaedi hparasuis hpylori kpneumoniae leptospira
saureus xfastidiosa	yersinia ypseudotuberculosis yruckeri

The above list is shortened. The full included database includes 106 schemes.

###Genotyping one genome

To genotype a FASTA file of contigs:

% mlst --scheme saureus USA300.fasta

FILE	      SCHEME	ST	arcc aroe glpf gmk_	pta_ tpi_ yqil
USA300.fasta  saureus	8	3	 3	  1    1	4	 4	  3

This produces a tab-separated values file, with a header line and a line for the FASTA file. The "ST" column is the MLST type, and the subsequent columns are the allele variants for each of the locii within the scheme. In this case, the USA300 genome is Staphylococcus aureus ST8.

Only full-length, 100% identity matches to an allelle are considered matches. If any allelles are not found, a "-" will be present in the allele column, as well as in the ST column.

###Genotyping many genomes

The software supports multiple genome FASTA files, even compressed ones.

% mlst --scheme efaecium VAN*.fna.gz

FILE	SCHEME	    ST	 AtpA Ddl Gdh PurK Gyd	PstS Adk
VAN_219	efaecium	784  2    3	   1  67	1	30	 25
VAN_222	efaecium	784  2    3	   1  67	1	30	 25
VAN_327	efaecium	417  5    7	   5  7  	2	1	 1
VAN_332	efaecium	38	 2    5	   3  5	    4	4	 1
VAN_335	efaecium	417	 5    7	   5  7	    2	1	 1
VAN_342	efaecium	22	 2    3	   1  2	    1	1	 1
VAN_345	efaecium	38	 2    5	   3  5	    4	4	 1
VAN_476	efaecium	785	 2	  3	   1  67	1	86	 1

###Tweaking the output

The output is TSV (tab-separated values). This makes it easy to parse and manipulate with Unix utilities like cut and sort etc. For example, if you only want the filename and ST you can do the following:

% mlst --scheme abaumanii AB*.fasta | cut -f1,3 > ST.tsv

If you don't want the header line use the --noheader option:

% mlst --scheme leptospira --noheader L550.fa

If you prefer CSV because it loads more smoothly into MS Excel, use the --csv option:

% mlst --scheme hpylori --csv Peptobismol.fna.gz > mlst.csv


Please submit via the Github Issues page: https://github.com/Victorian-Bioinformatics-Consortium/mlst/issues




Torsten Seemann - http://vicbioinformatics.com/


Scan contig files against PubMLST typing schemes

License:GNU General Public License v2.0


Language:Perl 92.5%Language:Shell 7.5%