shenwei356 / gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Silva database?

davidvilanova opened this issue · comments

Hi,
Thanks for the gtdb-taxdump. I´m working on Silva , is there any available tax dump ?

No existing one, but you can create it by yourself following the example, using the binaries here.

Great,

I have tried so far, see shot below.

The taxonomy is two columns , fist column is accession and seconds columns are taxonomy as follows:

129138 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali

image

What is wrong ? there are two columns in the taxonomy file.

Taxa should be tab-separated, as shown in Example 4.

  1. Prepare lineages

     $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
         | sed 's/;/\t/g' \
         | cut -f 2-8
     Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas amygdali
    
  2. Create taxdump

     $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
         | sed 's/;/\t/g' \
         | cut -f 2-8 \
         | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump
    
  3. Test

     $ taxonkit list -n -r --ids 1 --data-dir taxdump/
     1 [no rank] root
       609216830 [superkingdom] Bacteria
         1641076285 [phylum] Proteobacteria
           329474883 [class] Gammaproteobacteria
             86398254 [order] Pseudomonadales
               1478401337 [famil] Pseudomonadaceae
                 1616653803 [genus] Pseudomonas
                   691281667 [species] Pseudomonas amygdali
    

If you need to map the Silver TaxIDs to the created ones.

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump2 -A 1
    
$ tree taxdump2/
taxdump2/
├── delnodes.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
└── taxid.map

$ cat taxdump2/taxid.map
129138  691281667

Great !!!
I have generated the database, can i use now with taxonkit -lca ??

Sure, please follow the usage and examples.

I have the following problem.
I have added a new database using the workflow above and generated the Silva to TaxIDs.

image

For a particular id (1824050977) i don´t have a corresponding mapping (see shot). However it is in names.dmp.

The taxid map was generated the above post with option "-A 1"

Why it is missing ?

It just maps custom IDs to TaxIDs of the taxa of the lowest rank, e.g. species.

If you need to map to taxa of other ranks, like the genus Staphylococcus. Here's the way, csvtk is needed.

cd taxdmp;

cat taxid.map \
    | taxonkit lineage --data-dir . -i 2 -t -d , \
    | cut -f 1,4 \
    | csvtk unfold -Ht -f 2 -s ,
    
129138  609216830
129138  1641076285
129138  329474883
129138  86398254
129138  1478401337
129138  1616653803
129138  691281667

Great,
One more comment. When using all ranks i will have a problem to get a proper taxonkit lca assignment because by default the upper levels will be picked up. That mean for a particular list of ids all with get a Bacteria assigment. Don´t know if that can be fixed ?

In this case for example the 609216830 that is a Streptoccus is matched to Bacteria. I would expect an assignation to the genus family/genus level at least.

image

So, mapping only IDs to taxa of species rank is reasonable.

Back to the previous concern, why did you want map IDs to the genus Staphylococcus?

For a particular id (1824050977) i don´t have a corresponding mapping (see shot)

Mapping to species is enough, cause the complete lineage of the species can be retrieved with the taxid of the species.

Could you please show how you query the LCA? What were the TaxIDs used? What's the direct purpose?

Yes sure,
What i´m trying to achieve is to annotate with Silva a set of nanopore reads.
For that purpose i´m mapping one metagenomic sample reads to the Silva 16S database with minimap allowing for error tolerance as nanopore reads tend to give an error rate. For each read mapped to silva i get the first 10 hits.

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

For some reads the resolution (because of the error rate) cannot be achieved at the species level, that´s why in some cases(like the Staphyloccocus case above) the best hit (output from LCA) is a genus. But could be family or order or class or phyla.

Does this makes sense?

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Good, here you've assigned LCA to each read.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

No, you can just use taxonkit lineage or taxonkit reformat -I to retrieve lineage via the LCAs, no matter what the rank they are. There's no need to query with the taxid.map file.

Thanks for the update, there is one strange behaviour for lca. I was expecting LCA to go up to the staphylococcus genus level but it went up to bacilli. I assume something is wrong with my database.

image

Here is the pipeline to build the database:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz

gunzip  SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz 
infile="SILVA_138.1_SSURef_NR99_tax_silva.fasta"

#Remove unwanted taxa
zcat  ${infile} | grep ">" |grep  -v 'Archaea\Eukaryota\|unidentified\|metagenomes\|environmental samples' >tab_silva_taxonomy

sed -i 's/>//g' tab_silva_taxonomy #replace ">"
sed -i 's/ /@@/' tab_silva_taxonomy # replace first space with @@
sed -i 's/ /_/g' tab_silva_taxonomy # replace all spaces with "_" mainly on species names
sed -i 's/@@/'$'\t''/' tab_silva_taxonomy # restore tab after accession
sed -i 's/;/'$'\t''/g' tab_silva_taxonomy # replace commas with semicolons
awk 'NF>7' tab_silva_taxonomy | cut -f 1-8 > tmp
mv tmp tab_silva_taxonomy
taxonkit create-taxdump tab_silva_taxonomy -O taxdump --force -RR "superkingdom,phylum,class,order,family,genus,species" -A 1

Don´t what is wrong...

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 3
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;813944714;95949142;368069282;1202746109
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;485431882

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 2
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Staphylococcus_saprophyticus
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sp._mixed_culture_J3-

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | sort | uniq  \
    | axonkit reformat --data-dir taxdump/ -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species"  \
    | csvtk pretty -t
taxid        kingdom    phylum       class     order              family              genus            species
----------   --------   ----------   -------   ----------------   -----------------   --------------   -------------------------------------
1202746109   Bacteria   Firmicutes   Bacilli   Bacillales         Planococcaceae      Sporosarcina     Staphylococcus_saprophyticus
1956460793   Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sciuri
485431882    Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sp._mixed_culture_J3-3
883645126    Bacteria   Firmicutes   Bacilli   Bacillales         Bacillaceae         Bacillus         Streptococcus_pneumoniae

The LCA seems to be 1845768359 (Bacilli), without a doubt.

BTW, the process could be simplified.

    seqkit seq -n SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz \
        | grep -v 'Archaea\|Eukaryota\|unidentified\|metagenomes\|environmental samples' \
        | sed 's/ /\t/' \
        | sed 's/;/\t/g' \
        | awk -F '\t' 'NF > 7' \
        > silva_taxonomy.tsv

    taxonkit create-taxdump silva_taxonomy.tsv -O taxdump \
        --force -R "superkingdom,phylum,class,order,family,genus,species" -A 1