TerenceDong / gut_resistotype

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About

This is the repository for the analysis scripts and data files used in the following study:

Population-level impacts of antibiotic usage on the human gut microbiome. Unpublished. (2021) Kihyun Lee, Sebastien Raguideau, Kimmo Sir`en, Francesco Asnicar, Fabio Cumbo, Falk Hildebrand, Nicola Segata, Chang-Jun Cha, Christopher Quince

Data files

Multiple links will be provided for the same file, sometimes, using dropbox and zenodo. You can also visit to the zenodo page https://doi.org/10.5281/zenodo.7383076

Human microbiome metagenome assemblies

Human microbiome metagenome assemblies previously published in Pasolli et al., 2019, Cell 176, 649–662**

Fasta files can be accessed following the instructions provided in http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html

Sample metadata provided in Pasolli et al. 2019 was additionally curated with respect to (a) health and disease status of the subject and (b) antibiotic usage preceding the sampling. The final metadata used in our study is provided in this link: microbiome sample metadata tsv or microbiome sample metadata tsv zenodo

NCBI RefSeq prokaryotic genome assemblies

We accessed all prokaryotic genomes available in NCBI RefSeq at 2019/04/19.

Assembly accession numbers (RefSeq GCF_* / GenBank GCA_*) of the prokaryotic genomes analyzed in our study is provided in this link: genome accession list tsv genome accession list tsv zenodo

Modified version of CARD used as the references for annotation of antibiotic resistance genes

We used the database of protein sequences for known ARGs, which was created by modifying the October 2017 version of CARD, cite Jia et al. 2017

For this modified version of CARD, the following files are provided in this repository.

(1) fasta file of protein homolog model reference sequences after filtering out variant/mutation-dependent resistance genes fasta file of protein homolog model reference sequences after filtering out variant/mutation-dependent resistance genes zenodo

(2) protein accessions to ARG families map tsv protein accessions to ARG families map tsv zenodo

(3) ARG families to antibiotics map tsv ARG families to antibiotics map tsv zenodo

(4) ARG families to mechanisms map tsv ARG families to mechanisms map tsv zenodo

(5) list of 752 ARG family names covered in our annotation DB txt list of 752 ARG family names covered in our annotation DB txt zenodo

COG protein sequences

We used COG, the 2014 update version, accessed through NCBI FTP at December 2018 using the following link COG 2014 protein fasta

Our analysis using COG is focused at the list of 40 universally single-copy orthologs SCGs txt 40 universally single-copy orthologs SCGs txt zenodo

Mapping from gi numbers (used in COG protein sequence headers) to COG numbers is needed and provided in this link:gi-to-COG mapping tsv gi-to-COG mapping tsv zenodo

shorter version only containing the COGs representing 40 SCGs gi-to-COG mapping SCGs only tsv gi-to-COG mapping SCGs only tsv zenodo

Tools used in our computational analysis steps

Below are the commands and scripts used in our main analytical steps.

Script files called in the commands below can be found in the script section (scripts) of this repository.

External tools called in the commands are:

Databases and fixed parameter files (such as list of SCGs, list of ARG families) can be found in the data branch of this repository.

Note that before using the .fasta database files in blastp searches you have to create diamond database locally using your version of diamond.

From assembled metagenomes to annotated ORFs

STEP 1. from input metagenome assemblies {SAMPLE}.fasta

create ORF nucleotide/protein sequence fasta and gff files {SAMPLE}.fna {SAMPLE}.faa {SAMPLE}.gff

prodigal -p meta -q -f gff -i {SAMPLE}.fasta -o {SAMPLE}.gff -a {SAMPLE}.faa -d {SAMPLE}.fna 

STEP 2. from ORF protein fasta {SAMPLE}.faa

search blastp hits in CARD, filter hits and annotate gene families to the ORFs {SAMPLE}.CARD2017.blastp.i80d80.assign

diamond blastp -d protein_fasta_protein_homolog_model.refined.legit.fasta.dmnd -p {threads} -e 1e-20 -k 1 -q {SAMPLE}.faa -o {SAMPLE}.CARD2017.blastp

java BlastResultFilterLkh -in {SAMPLE}.CARD2017.blastp -idcut 80 -dfasta protein_fasta_protein_homolog_model.refined.legit.fasta -dcovCut 80 -out {SAMPLE}.CARD2017.blastp.i80d80

java TableColumnReformatVerticalSepField -in {SAMPLE}.CARD2017.blastp.i80d80 -col 2 -field 2 -header F -out {SAMPLE}.CARD2017.blastp.i80d80.acctmp

java TableTranslateColumnByDictionaryLkh -i {SAMPLE}.CARD2017.blastp.i80d80.acctmp -icol 2 -iheader F -d protein_fasta_protein_homolog_model.refined.accession_2_ARG_cluster_assign.tab -dheader T -q 1 -t 2 -o {SAMPLE}.CARD2017.blastp.i80d80.assign

search blastp hits in COG, filter SCG hits and annotate SCG COG numbers to the ORFs {SAMPLE}.COG.blastp.cog.scg

diamond blastp -d prot2003-2014.fa.dmnd -p {threads} -e 1e-7 -k 1 -q {SAMPLE}.faa -o {SAMPLE}.COG.blastp

java TableTranslateColumnByDictionaryLkh -i {SAMPLE}.COG.blastp -icol 2 -iheader F -d gi_to_cog.tab -dheader F -q 1 -t 2 -o {SAMPLE}.COG.blastp.cog

java TableFilteringRowsLkh -tbl {SAMPLE}.COG.blastp.cog -sep t -headerRow F -col 2 -targetList list_speci_universal_single_copy_genes.cogs -out {SAMPLE}.COG.blastp.cog.scg

Snakemake that we used to predict and annotate ORFs in metagenome assemblies is available here metagenome ORF snake

Note that file paths in the snakemake should be adjusted to fit with yours.

From reference genomes to annotated ORFs

For genome assemblies, we used the set of commands that are same with what was used for metagenom assemlbies.

Snakemake that we used to predict and annotate ORFs in metagenome assemblies is available here genome ORF snake

Note that file paths in the snakemake should be adjusted to fit with yours.

Creating nomalized abundance profiles of ARG families (copies per genome, cpg)

Case A, where you want to use contig coverage depth calculated by yourself through aligning reads to the contigs.

Run the following steps for each/every sample individually.

starting with assembled contigs in {SAMPLE}.fasta and raw reads in {SAMPLE}_1.fastq.gz and {SAMPLE}_2.fastq.gz

and ORF to SCG, ORF to CARD annotation outputs {SAMPLE}.COG.blastp.cog.scg {SAMPLE}.CARD.blastp.i80d80.assign

python fasta_filter_by_length.py --fasta {SAMPLE}.fasta --min_length 500 --out_tab {SAMPLE}.length.tab --out_bed {SAMPLE}.filt_500bp.bed --out_fasta {SAMPLE}.filt_500bp.fasta

bowtie2-build --quiet {SAMPLE}.filt_500bp.fasta {SAMPLE}.filt_500bp

bowtie2 -q --very-fast -x {SAMPLE}.filt_500bp -1 {SAMPLE}_1.fastq.gz -2 {SAMPLE}_2.fastq.gz --no-unal -p 2 -S {SAMPLE}.filt_500bp.sam

samtools view -b --threads 3 -o {SAMPLE}.filt_500bp.bam {SAMPLE}.filt_500bp.sam

samtools sort --threads 3 -o {SAMPLE}.filt_500bp.sort.bam {SAMPLE}.filt_500bp.bam

samtools index {SAMPLE}.filt_500bp.sort.bam

samtools depth -b {SAMPLE}.filt_500bp.bed -m 15000 {SAMPLE}.filt_500bp.sort.bam > {SAMPLE}.filt_500bp.depth

python samtools_depth_to_contig_stat.py --bed {SAMPLE}.filt_500bp.bed --depth {SAMPLE}.filt_500bp.depth --out {SAMPLE}.filt_500bp.cov_stat

python cpg_profile_from_contig_orf_annotation.py --cov_stat {SAMPLE}.filt_500bp.cov_stat --depth_col 2 --scg_annot {SAMPLE}.COG.blastp.cog.scg --scg_panel list_speci_universal_single_copy_genes.cogs --target_annot {SAMPLE}.CARD.blastp.i80d80.assign --target_panel reference_ARG_assigns.txt --out {SAMPLE}.ARG_panel.cpg

Case B, where you want to use the coverage information stored in the metaSpades contigs header lines

Run the following steps for each/every sample individually.

starting with assembled contigs in {SAMPLE}.fasta

and ORF to SCG, ORF to CARD annotation outputs {SAMPLE}.COG.blastp.cog.scg {SAMPLE}.CARD.blastp.i80d80.assign

python cpg_profile_from_contig_orf_annotation.py --cov_stat {SAMPLE}.fasta --depth_col 0 --scg_annot {SAMPLE}.COG.blastp.cog.scg --scg_panel list_speci_universal_single_copy_genes.cogs --target_annot {SAMPLE}.CARD.blastp.i80d80.assign --target_panel reference_ARG_assigns.txt --out {SAMPLE}.ARG_panel.cpg

Resulting files that we produced at this point

  • Normalized abundance of ARG families (cpgs; samples X ARG families matrix) in 6104 adult stool metagenomes: tsv file dropbox tsv file zenodo
  • Normalized abundance of ARG families (cpgs; samples X ARG families matrix) in 6006 adult stool metagenomes after filtering out outliers: tsv file dropbox tsv file zenodo
  • Note that sample names are given as row names and ARG family names are given as column names.

Creating the catalogue of ARG sequences from the metagenomes and the reference genomes

We first collected the nucleotide sequences of the ORFs annotated as ARGs, from both metagenomes and genomes.

Each metagenome sample was processed with the following command.

# SAMPLE = metagenome sample ID

java PrepRichTitledFeatureFastaForDBHitORFs -s {SAMPLE} -d gut -f {SAMPLE}.fna -annot {SAMPLE}.CARD.blastp.i80d80.assign -o {SAMPLE}.CARD_hit_i80d80.faa -cgb {SAMPLE}.contig_bin.map -splabel {SAMPLE}.bin_SGB.map -sptax SGB.SGB_taxonomy.tsv -annotTitle CARD -genometype genomeBin -sptype SGB

Each refseq genome assembly was processed with the following command.

# GACC = genome assembly accession

java PrepRichTitledFeatureFastaForDBHitORFs -s refseq -d refseq -f {GACC}.fna -annot {GACC}.CARD.blastp.i80d80.assign -o {GACC}.CARD_hit_i80d80.fna -wg {GACC} -gtax {GACC}.taxonomy_kpcofgst -annotTitle CARD -genometype refGenome -sptype refSpecies

The resulting fasta files, created for each metagenome/genome contains information-rich header lines which will be later used in the analyses downstream of clustering step.

Header lines for metagenomic ORFs look like:

>NODE_1_length_562825_cov_57.8364_1;sample=MV_FEI1_t1Q14;genomeBin=AsnicarF_2017__MV_FEI1_t1Q14__bin.8;SGB=10068;taxonomy=k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__SGB10068;dataset=gut;COG=COG1215

Header lines for genomic ORFs look like:

>NZ_AYGS01000026.1_2;sample=refseq;refGenome=GCF_000513795.2;refSpecies=TBD;taxonomy=k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Acinetobacter|s__Acinetobacter baumannii|t__Acinetobacter baumannii UH0207;dataset=refseq;CARD=mphD

Next, the ORF nucleotide fasta files of all metagenomes and genomes were concatenated into a single large fasta file, subjected to clustering.

Resulting files that we produced at this point

Then perform clustering on the nucleotides sequences (rather than amino acid sequences) at several identity cutoffs,
using the following clustering commands:

\# fna = {single fasta file containing all ARG ORF nucleotide sequences with information-rich header lines}
mmseqs createdb ${fna} ${fna}.mmdb

# 100% identity 90% coverage clustering
mkdir nt_cluster_100_tmp
mmseqs cluster ${fna}.mmdb nt_clusters_not_linc_i100_c90/nt_cluster_100 nt_cluster_100_tmp --threads 12 --min-seq-id 1.0 -c 0.90 --cov-mode 0
mmseqs createtsv ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i100_c90/nt_cluster_100 nt_clusters_not_linc_i100_c90/nt_cluster_100.tsv
mmseqs result2repseq ${fna}.mmdb nt_clusters_not_linc_i100_c90/nt_cluster_100 nt_clusters_not_linc_i100_c90/nt_cluster_100_rep
mmseqs result2flat ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i100_c90/nt_cluster_100_rep nt_clusters_not_linc_i100_c90/nt_cluster_100_rep.fasta

# 99% identity 90% coverage clustering
mkdir nt_cluster_99_tmp
mmseqs cluster ${fna}.mmdb nt_clusters_not_linc_i99_c90/nt_cluster_99 nt_cluster_99_tmp --threads 12 --min-seq-id 0.99 -c 0.90 --cov-mode 0
mmseqs createtsv ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i99_c90/nt_cluster_99 nt_clusters_not_linc_i99_c90/nt_cluster_99.tsv
mmseqs result2repseq ${fna}.mmdb nt_clusters_not_linc_i99_c90/nt_cluster_99 nt_clusters_not_linc_i99_c90/nt_cluster_99_rep
mmseqs result2flat ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i99_c90/nt_cluster_99_rep nt_clusters_not_linc_i99_c90/nt_cluster_99_rep.fasta

# 95% identity 80% coverage clustering
mkdir nt_cluster_95_tmp
mmseqs cluster ${fna}.mmdb nt_clusters_not_linc_i95_c80/nt_cluster_95 nt_cluster_95_tmp --threads 12 --min-seq-id 0.95 -c 0.80 --cov-mode 0
mmseqs createtsv ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i95_c80/nt_cluster_95 nt_clusters_not_linc_i95_c80/nt_cluster_95.tsv
mmseqs result2repseq ${fna}.mmdb nt_clusters_not_linc_i95_c80/nt_cluster_95 nt_clusters_not_linc_i95_c80/nt_cluster_95_rep
mmseqs result2flat ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i95_c80/nt_cluster_95_rep nt_clusters_not_linc_i95_c80/nt_cluster_95_rep.fasta

# 90% identity 80% coverage clustering
mkdir nt_cluster_90_tmp
mmseqs cluster ${fna}.mmdb nt_clusters_not_linc_i90_c80/nt_cluster_90 nt_cluster_90_tmp --threads 12 --min-seq-id 0.90 -c 0.80 --cov-mode 0
mmseqs createtsv ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i90_c80/nt_cluster_90 nt_clusters_not_linc_i90_c80/nt_cluster_90.tsv
mmseqs result2repseq ${fna}.mmdb nt_clusters_not_linc_i90_c80/nt_cluster_90 nt_clusters_not_linc_i90_c80/nt_cluster_90_rep
mmseqs result2flat ${fna}.mmdb ${faa}.mmdb nt_clusters_not_linc_i90_c80/nt_cluster_90_rep nt_clusters_not_linc_i90_c80/nt_cluster_90_rep.fasta

Resulting files that we produced at this point

Assign LCA to the clusters of ARGs

We performed assignment of LCA taxon to each ARG cluster defined at 99% identity.

  • LCA for each cluster derived from the cluster member ORFs that came from either a RefSeq genome or a high-quality MAG.
  • Every single RefSeq genome and high-quality MAG has pre-defined species-level genomic bin (SGB) affiliation and full-rank taxonomy. See the files linked below.
  • Clusters without any such genome-resolved ORF becomes 'unclassified'

Mapping files used at this step:

  • Metagenome bin ID to SGB ID tsv file
  • RefSeq genome assembly accession to SGB ID tsv file
  • SGB ID to full rank taxonomy tsv file
  • List of high-quality MAGs txt file

Now LCA assignment was performed using the following command,
where

{CLUSTER_TSV} = the output tsv file from mmseqs createtsv in the previous step
{HQMAG_LIST} = the file containing list of high-quality MAGs
{HQMAG_SGB_MAP} = the file mapping from metagenome bin ID to SGB ID
{REFSEQ_SGB_MAP} = the file mapping from RefSeq genome assembly accession to SGB ID
{SGB_TAX_MAP} = the file mapping from SGB ID to the full rank taxonomy

run

java GeneClusterTsvSubjectToSimpleLCAClassifier -i {CLUSTER_TSV} -hq {HQMAG_LIST} -r2s {REFSEQ_SGB_MAP} -h2s {HQMAG_SGB_MAP} -s2t {SGB_TAX_MAP} -o LCA_analysis_for_ARG_cl99.kp2SGB

lca.py -i LCA_analysis_for_ARG_cl99.kp2SGB.LCAsub_hqPlusRef -o LCA_analysis_for_ARG_cl99.kp2SGB.result_using_HQPlusRef -b 999 -id 0 -cov 0 -tid 0 -tcov 0 -t no -fh "env" -flh "unknown"

About


Languages

Language:Java 79.1%Language:Python 20.9%