zheminzhou / KleTy

KleTy: Klebsiella genotyping for its core genome and plasmids

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KleTy (Klebsiella typer for the core genome and plasmids)

KleTy logo

KleTy is a tool to type Klebsiella genome assemblies for:

  • core genome MLST (cgMLST) for detailed genotyping of the core genome
  • Hierarchical clusters (HierCC) that represents natural population
  • Plasmid prediction and classification (PC)
  • hypervirulence associated loci
  • antimicrobial resistance determinants

Citation

KleTy: integrated typing scheme for core genome and plasmids reveals repeated emergence of multi-drug resistant epidemic lineages in Klebsiella worldwide Heng Li, Xiao Liu, Shengkai Li, Jie Rong, Shichang Xie, Yuan Gao, Ling Zhong, Quangui Jiang, Guilai Jiang, Yi Ren, Wanping Sun, Yuzhi Hong, Zhemin Zhou medRxiv 2024.04.16.24305880; doi: https://doi.org/10.1101/2024.04.16.24305880


INSTALLATION:

KleTy was developed and tested in Python >=3.8. It depends on several Python libraries:

click
numba
numpy
pandas
biopython
pyarrow
fastparquet

All libraries can be installed using pip:

pip install click numba numpy pandas biopython pyarrow fastparquet

KleTy also calls NCBI-BLAST+:

ncbi-blast+

Which can be installed via 'apt' in UBUNTU:

sudo apt install -y ncbi-blast+

The whole environment can also be installed in conda:

conda create --name dty python==3.11
conda activate dty
conda install -c conda-forge biopython numba numpy pandas click pyarrow fastparquet
conda install -c bio-conda blast

The installation process normally finishes in <10 minutes.

NOTE: Please make sure that "makeblastdb" and "blastn" are all in the PATH environment variable (can be run without pointing to their actual location).

When run for the first time, KleTy will automatically download the reference plasmids from https://zenodo.org/records/12590507/files/plasmids.repr.fas.gz This will only run once. But note that the file is fairly large (816 MB), and will take a long time to download.

Alternatively, for those who have difficulty downloading the file within the pipeline. Please download the file by yourself, and copy it into "db/" under the KleTy folder. Then run

gzip -d plasmids.repr.fas.gz
makeblastdb -in plasmids.repr.fas -dbtype nucl

to generate the required database.

Quick Start (with examples)

Get allelic and HierCC callings

$ cd /path/to/KleTy/
$ python KleTy.py -q examples/CP015990.fna

The whole calculation finishes in ~1 minutes with 8 CPU threads (~2.5 minutes with one CPU thread). The screen output will be like:

07/04/2024 04:44:56 AM Running query: examples/CP015990.fna
07/04/2024 04:44:56 AM  Searching VF/STRESS genes...
07/04/2024 04:45:06 AM  Done.
07/04/2024 04:45:06 AM  Searching AMR genes...
07/04/2024 04:45:16 AM  Done.
07/04/2024 04:45:16 AM  Searching plasmids...
07/04/2024 04:45:51 AM  Done.
07/04/2024 04:45:51 AM  Running cgMLST...
07/04/2024 04:46:09 AM  Done.

And there are two outputs (see below for explanation):

CP015990.KleTy
CP015990.cgMLST.profile.gz

USAGE:

KleTy.py - allelic callings and HierCC clusters & species predictions

Usage: KleTy.py [OPTIONS]

Options:
  -q, --query TEXT        query genome in fasta or fastq format. May be
                          gzipped.
  --ql TEXT               a list of query files. One query per line.
  -o, --prefix TEXT       prefix for output. Only work when there is only one
                          query. default: query filename
  -n, --n_proc INTEGER    number of process to use. default: 8
  -f, --plasmid_fragment  flag to predict plasmid fragment sharing < 50% with
                          the reference
  -m, --skip_gene         flag to skip AMR/VF searching. default: False
  -g, --skip_cgmlst       flag to skip cgMLST. default: False
  -p, --skip_plasmid      flag to skip plasmid typing. default: False
  --help                  Show this message and exit.

Parameters:

Parameter Explanation
-q, --query Query genome. This can be in Fasta or Fastq format, and can be in plain text or GZIPped.
--ql A list of query files. One query genome (file location) per line. KleTy will run these queries one by one and concatenate the outputs together.
-o, --prefix Prefix for the outputs. There will be two files .KleTy and .cgMLST.profile.gz. Will use the prefix of the query file (or the ql file) if not specified.
-n, --n_proc Number of processes to use. Default: 8
-f, --plasmid_fragment Flag to predict less reliable plasmid fragments that share <50% (but >=30%) of the reference plasmid.
-m, --skip_gene Flag to skip AMR/VF Searching. This step normaly taks ~ 15 seconds.
-g, --skip_cgmlst Flag to skip cgMLST calling. This step normaly taks ~ 20 seconds.
-p, --skip_plasmid Flag to skip plasmid prediction. This step normaly taks ~ 30 seconds.

Outputs:

KleTy generates:

<prefix>.KleTy

.KleTy contains the genotyping results

$ cat CP015990.KleTy
INPUT   REPLICON        SPECIES HC1360.500.200.100.50.20.10.5.2 REFERENCE       PLASTYPE        COVERAGE        AMR:AMINOGLYCOSIDE      AMR:BETA-LACTAM AMR:CARBAPENEM  AMR:ESBL        AMR:INHIBITOR-RESISTANT AMR:COLISTIN    AMR:FOSFOMYCIN  AMR:MACROLIDE   AMR:PHENICOL    AMR:QUINOLONE   AMR:RIFAMYCIN   AMR:GLYCOPEPTIDES       AMR:SULFONAMIDE AMR:TETRACYCLINE        AMR:TIGECYCLINE AMR:TRIMETHOPRIM        AMR:BLA_INTRINSIC       STRESS:COPPER   STRESS:MERCURY  STRESS:NICKEL   STRESS:SILVER  STRESS:TELLURIUM STRESS:ARSENIC  STRESS:FLUORIDE STRESS:QUATERNARY_AMMONIUM      VIRULENCE:clb   VIRULENCE:iro   VIRULENCE:iuc   VIRULENCE:rmp   VIRULENCE:ybt   Others  REPLICON:INC_TYPE       REPLICON:MOB_TYPE       REPLICON:MPF_TYPE       ANNOTATION      CONTIGS
examples/CP015990.fna   ALL     Klebsiella_pneumoniae   10.10.10.10.ND.ND.ND.ND.ND      KLE_DA0156AA_AS -       -       aac(3)-IId^,aac(6')-Ib-cr.v2^,aadA16*   OXA-1   KPC-2   -       -       -       -       mphA    catB3.v2        GyrA-83F,GyrA-87A,ParC-80I,qnrA3^       arr-3   -       sul1    -       -       dfrA27  SHV-28^ -       merA,merE,merR_Ps,merT  -       -       -       -       -       qacEdelta1      -       -       -       -       fyuA_26,irp1_275,irp2_30,ybtA_78,ybtE_58,ybtP_75,ybtQ_88,ybtS_115,ybtT_26,ybtU_129,ybtX_73      -       IncR    -       MPF_T   -       -
examples/CP015990.fna   P1      -       -       CP059309.1      PT_361,PC_361   84.9    aac(6')-Ib-cr.v2^,aadA16*       OXA-1   KPC-2   -       -       -       -       mphA    catB3.v2        qnrA3^  arr-3   -       sul1    -       -       dfrA27 --       merA,merE,merR_Ps,merT  -       -       -       -       -       qacEdelta1      -       -       -       -       -       -       IncR    -       -       Klebsiella_pneumoniae_strain_Kp46596_plasmid_pKp46596-3,_complete_sequence      CP015991.1
examples/CP015990.fna   Others  -       -       -       -       -       aac(3)-IId^     -       KPC-2   -       -       -       -       mphA    -       GyrA-83F,GyrA-87A,ParC-80I      -       -       -       -       -       -       SHV-28^ -      --       -       -       -       -       -       -       -       -       -       fyuA_26,irp1_275,irp2_30,ybtA_78,ybtE_58,ybtP_75,ybtQ_88,ybtS_115,ybtT_26,ybtU_129,ybtX_73      -       -       -       MPF_T   -       -

The columns are:

Column Explanation
INPUT Filename of the input. Used to recognize query assemblies
REPLICON Type of the replicon. It can be: "ALL" - A summary of the query. "P" - One plasmid per row. "Others" - Summary of the AMR/VF genes that are not in plasmids (likely carried by the chromosome).
SPECIES Species designation of the query, inferred based on its cgMLST profile. Will not be reported with '-g'.
HC1360.500.200.100.50.20.10.5.2 HierCC cluster designation of the query based on the cgMLST profile. HC1360 approximately equals to clonal complex (CC) in MLST. Lower HC levels were used for sub-population clusterings. Numbers after HC indicate the criteria of the single-linkage clustering. Will not be reported with '-g'.
REFERENCE Accession of the reference for predicted plasmid. Will not be reported with '-p'.
PLASTYPE PT (plasmid type) and PC (plasmid cluster) of the predicted plasmid. Will not be reported with '-p'.
COVERAGE Coverage of the plasmid to the reference. Will not be reported with '-p'.
AMR:AMINOGLYCOSIDE Predicted genes/mutations encoding resistance to AMINOGLYCOSIDE.
AMR:BETA-LACTAM Predicted genes/mutations encoding resistance to BETA-LACTAM.
AMR:CARBAPENEM Predicted genes/mutations encoding resistance to CARBAPENEM.
AMR:ESBL Predicted genes/mutations encoding Extended-spectrum beta-lactamases (ESBLs).
AMR:INHIBITOR-RESISTANT Predicted genes/mutations encoding resistance to Beta-Lactamase inhibitors.
AMR:COLISTIN Predicted genes/mutations encoding resistance to COLISTIN.
AMR:FOSFOMYCIN Predicted genes/mutations encoding resistance to FOSFOMYCIN.
AMR:MACROLIDE Predicted genes/mutations encoding resistance to MACROLIDE.
AMR:PHENICOL Predicted genes/mutations encoding resistance to PHENICOL.
AMR:QUINOLONE Predicted genes/mutations encoding resistance to QUINOLONE.
AMR:RIFAMYCIN Predicted genes/mutations encoding resistance to RIFAMYCIN.
AMR:GLYCOPEPTIDES Predicted genes/mutations encoding resistance to GLYCOPEPTIDES.
AMR:SULFONAMIDE Predicted genes/mutations encoding resistance to SULFONAMIDE.
AMR:TETRACYCLINE Predicted genes/mutations encoding resistance to TETRACYCLINE.
AMR:TIGECYCLINE Predicted genes/mutations encoding resistance to TIGECYCLINE.
AMR:TRIMETHOPRIM Predicted genes/mutations encoding resistance to TRIMETHOPRIM.
AMR:BLA_INTRINSIC Predicted intrinsic beta-lactamase in Klebsiella.
STRESS:COPPER Predicted genes encoding resistance to COPPER.
STRESS:MERCURY Predicted genes encoding resistance to MERCURY.
STRESS:NICKEL Predicted genes encoding resistance to NICKEL.
STRESS:SILVER Predicted genes encoding resistance to SILVER.
STRESS:TELLURIUM Predicted genes encoding resistance to TELLURIUM.
STRESS:ARSENIC Predicted genes encoding resistance to ARSENIC.
STRESS:FLUORIDE Predicted genes encoding resistance to FLUORIDE.
STRESS:QUATERNARY_AMMONIUM Predicted genes encoding resistance to QUATERNARY_AMMONIUM.
VIRULENCE:clb colibactin (clb)
VIRULENCE:iro salmochelin (iro)
VIRULENCE:iuc aerobactin (iuc)
VIRULENCE:rmp hypermucoidy (rmpA, rmpA2)
VIRULENCE:ybt yersiniabactin (ybt)
Others Other resistances
REPLICON:INC_TYPE INC type of the plasmid.
REPLICON:MOB_TYPE MOB type of the plasmid.
REPLICON:MPF_TYPE MPF type of the plasmid.
ANNOTATION Annotations of the predicted plasmids.
CONTIGS Contigs associated with the predicted plasmids.

.cgMLST.profile.gz contains the MD5 hashed allelic profiles.

This file can be used as inputs for GrapeTree(https://achtman-lab.github.io/GrapeTree/MSTree_holder.html) when ungzipped.


Reproduction Instructions

All data required for reproduction of the analysis were distributed in this repository under https://github.com/zheminzhou/KleTy/tree/main/db

These includes:

  • plasmids.repr.clu.gz - IMPORTANT. A mapping table that specifies correlations between plasmids and PT/PCs.
  • HierCC.tsv.gz - A tab-delimited table consisting of HierCC results for all ~70,000 genomes
  • klebsiella.cgmlst - A list of core genes used in the dcgMLST scheme
  • klebsiella.refsets.fas.gz - reference alleles for all pan genes (for calling new alleles)
  • klebsiella.species - A mapping table that specifies correlations between genomes and Klebsiella species
  • profile.parq - Allelic profiles of all ~70,000 genomes in parquet format, and can be read using the Pandas library (https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html).
  • stress_CDS.gz - reference sequences for resistance to metal/biocides
  • traditional_lasmid_type.fas.gz - reference sequences for INC/MOB/MPF types of the plasmids.
  • kleborate/* - reference sequences from kleborate.

About

KleTy: Klebsiella genotyping for its core genome and plasmids

License:GNU General Public License v3.0


Languages

Language:Python 100.0%