- TIPP paper - https://doi.org/10.1093/bioinformatics/btu721
- TIPP software can be found here.
- TIPP reference dataset (2014) - https://github.com/tandyw/tipp-reference/releases/download/v2.0.0/tipp.zip
- (New) TIPP reference dataset (2020) - https://obj.umiacs.umd.edu/tipp/tipp2-refpkg.tar.gz
In this document, we describe the protocol used to construct a new version of TIPP reference packages. We used the same set of 40 marker genes as used by Mende et al. 2013, Sunagawa et al. 2013, and mOTUs. These marker genes are believed to be single-copy and universally present in prokaryotic genomes.
We downloaded all Bacterial and Archael genomes from the NCBI RefSeq database. RefSeq provides a metadata file for both Bacteria and Archaea genomes. This file contains useful information such as genome accession, taxid, name, ftp download information, etc. We downloaded these files in November 2019,, and are provided in the data folder. One can download the latest version of files from
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt
For each genome accession in this master list, we download the genome sequence data, protein sequences, and nucleotide gene sequences. We modify the names of the protein and nucleotide gene sequences such that the corresponding protein and nucleotide gene sequence have same name. We also add genome accession to the start of the gene sequence, to make it possible to track the origin of the gene sequence in the database.
For example, one of the gene sequence in ArgS (COG0018) marker gene is GCF_002287175.1_NZ_LMVM01000012.1_cds_WP_069582217.1_1177
, which has genome accession, sequence accession, and the protein name information in it's identifier.
Once, we have protein and corresponding nucleotide gene sequences from a genome, we used fetchMG tool to extract 40 marker gene sequences. Note that fetchMG uses both protein and nucleotide sequences, so keep the name of the two files same, just change the extension to .faa for proteins ana .fna for nucleotide sequences. We used fetchMG v1.0 for our analysis. \
fetchMG.pl -m extraction -v gene_aa_filename -o output_folder
The script get_sequences.py combines the steps of downloading genomes and extracting 40 marker genes, just run
python get_sequences.py
This should create an output folder with *.fna (nucleotide gene sequences) and *.faa (protein sequences) files for the 40 marker genes. The report.txt has metadata information for the selected gene sequences. Please look at the output_README.txt file for column headers.
We removed all gene sequences that had a mismatch in their nucleotide and protein sequence lengths. The nucleotide gene sequence length (after removing stop codon) should match 3*protein sequence length. We also removed sequences that were 3 std deviation away from the median gene length. There are ~140K-170K gene sequences per marker gene after filtering based on length.
For each marker gene, we generate multiple sequence alignment (MSA) of the protein sequences using UPP software. The parameters chosen essentially generates PASTA alignments, because we have carefully chosen gene sequences such that they are almost all full-length. We used UPP version 4.3.10 software.
run_upp.py -s gene_aa_filename.faa -p tmp_dir -B 1000000 -M -1 \
-T 0.33 -m amino -o output_folder
We used alignment with insertion sites masked (*masked.fasta) in all subsequent steps.
Once we generate protein MSAs, we translate the proteins back to nucleotide sequences. Because we have nucleotide sequences for each protein sequence, we use that information to match amino acid to the corresponding codon.
python backtranslate_refseq.py protein_alignment.fasta \
gene_nuc_filename.fna gene_nuc_alignment.fasta output_log.txt
We removed gappy sites from the alignment. This step is performed after translating back to DNA, so that we don't lose amino acid to codon mapping from the RefSeq files. We removed all sites with 95% or more gaps.
f=[alignment file]
percent=5
m=`echo $( grep ">" $f|wc -l ) \* $percent / 100 |bc`
run_seqtools.py -infile $f -masksites $m -outfile $f.mask${percent}sites.fasta
We also remove sequences that are fragments.
taxapercent=33
m2=`echo $( cat $f.mask${percent}sites.fasta|wc -L ) \* $taxapercent / 100 |bc`
run_seqtools.py -infile $f.mask${percent}sites.fasta -filterfragments $m2 -outfile $out
Based on Refseq metadata file, we have taxid for each genome sequence. However, sometimes NCBI taxid gets depreciated, merged, or updated. To get the latest taxid and the complete NCBI lineage for each gene sequence, we run the following steps. We rely heavily on Taxtastic v0.8.11 software.
python generate_species_mapping.py gene_nuc_alignment.fna species.mapping
cut -f 2 -d ',' species.mapping > species.txt
taxit update_taxids species.txt -o species.updated.txt
taxit taxtable -i species.updated.txt -o taxonomy.table
paste -d "," species.txt <( sed 's/"//g' species.updated.txt) > species.old2new.mapping
python update_species_mapping.py species.old2new.mapping \
species.mapping species.updated.mapping
python build_taxonomic_tree.py taxonomy.table species.updated.txt unrefined.taxonomy
perl build_unrefined_tree.pl species.updated.mapping \
unrefined.taxonomy unrefined.taxonomy.renamed
The output of “taxit taxtable” is a table with all the taxonomic ranks organized. This will be later used in computing taxonomic profile.
We used GTRCAT and GTRGAMMA model of RAxML (version 8.2.12) to generate refined taxonomy for each marker gene.
raxmlHPC-PTHREADS-AVX -j -m GTRCAT -F -T 4 -p 1111 -g unrefined.taxonomy.renamed \
-s gene_nuc_alignment.fasta -n refined -w ${work}/raxml_output/
raxmlHPC-PTHREADS-AVX -j -m GTRGAMMA -f e -t RAxML_result.refined -T 4 -p 1111 \
-s gene_nuc_alignment.fasta -n optimized -w ${work}/raxml_output
We also generate a maximum likelihood gene tree for each marker gene.
raxmlHPC-PTHREADS-AVX -j -m GTRCAT -F -T 4 -p 1111 \
-s gene_nuc_alignment.fasta -n mlgene -w {work}/raxml_output_mlgene/
raxmlHPC-PTHREADS-AVX -j -m GTRGAMMA -f e -t RAxML_result.mlgene -T 4 -p 1111 \
-s gene_nuc_alignment.fasta -n optimized -w ${work}/raxml_output_mlgene/
We concatenate all gene sequences (nucleotide) from all marker genes to create a combined fasta file, and a sequence to marker mapping file (seq2marker.tab). We used BLAST+ (v 2.9.0) to create database files.
cat <all nucleotide gene sequence files (*.fna)> > alignment.fasta
makeblastdb -in alignment.fasta -out alignment.fasta -dbtype nucl