Silva database?

Question

Silva database?

davidvilanova opened this issue 2 years ago · comments

davidvilanova commented 2 years ago

Hi,
Thanks for the gtdb-taxdump. I´m working on Silva , is there any available tax dump ?

Wei Shen · Answer 1 · Sat Nov 26 2022 10:50:19 GMT+0800 (China Standard Time)

No existing one, but you can create it by yourself following the example, using the binaries here.

davidvilanova · Answer 2 · Mon Nov 28 2022 16:55:44 GMT+0800 (China Standard Time)

Great,

I have tried so far, see shot below.

The taxonomy is two columns , fist column is accession and seconds columns are taxonomy as follows:

129138 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali

What is wrong ? there are two columns in the taxonomy file.

Wei Shen · Answer 3 · Mon Nov 28 2022 17:24:27 GMT+0800 (China Standard Time)

Taxa should be tab-separated, as shown in Example 4.

Prepare lineages

 $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
     | sed 's/;/\t/g' \
     | cut -f 2-8
 Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas amygdali

Create taxdump

 $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
     | sed 's/;/\t/g' \
     | cut -f 2-8 \
     | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump

Test

 $ taxonkit list -n -r --ids 1 --data-dir taxdump/
 1 [no rank] root
   609216830 [superkingdom] Bacteria
     1641076285 [phylum] Proteobacteria
       329474883 [class] Gammaproteobacteria
         86398254 [order] Pseudomonadales
           1478401337 [famil] Pseudomonadaceae
             1616653803 [genus] Pseudomonas
               691281667 [species] Pseudomonas amygdali

Wei Shen · Answer 4 · Mon Nov 28 2022 17:28:40 GMT+0800 (China Standard Time)

If you need to map the Silver TaxIDs to the created ones.

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump2 -A 1
    
$ tree taxdump2/
taxdump2/
├── delnodes.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
└── taxid.map

$ cat taxdump2/taxid.map
129138  691281667

davidvilanova · Answer 5 · Mon Nov 28 2022 23:39:56 GMT+0800 (China Standard Time)

Great !!!
I have generated the database, can i use now with taxonkit -lca ??

Wei Shen · Answer 6 · Mon Nov 28 2022 23:45:49 GMT+0800 (China Standard Time)

Sure, please follow the usage and examples.

davidvilanova · Answer 7 · Thu Dec 01 2022 17:51:23 GMT+0800 (China Standard Time)

I have the following problem.
I have added a new database using the workflow above and generated the Silva to TaxIDs.

For a particular id (1824050977) i don´t have a corresponding mapping (see shot). However it is in names.dmp.

The taxid map was generated the above post with option "-A 1"

Why it is missing ?

Wei Shen · Answer 8 · Thu Dec 01 2022 18:08:29 GMT+0800 (China Standard Time)

It just maps custom IDs to TaxIDs of the taxa of the lowest rank, e.g. species.

If you need to map to taxa of other ranks, like the genus Staphylococcus. Here's the way, csvtk is needed.

cd taxdmp;

cat taxid.map \
    | taxonkit lineage --data-dir . -i 2 -t -d , \
    | cut -f 1,4 \
    | csvtk unfold -Ht -f 2 -s ,
    
129138  609216830
129138  1641076285
129138  329474883
129138  86398254
129138  1478401337
129138  1616653803
129138  691281667

davidvilanova · Answer 9 · Thu Dec 01 2022 20:51:56 GMT+0800 (China Standard Time)

Great,
One more comment. When using all ranks i will have a problem to get a proper taxonkit lca assignment because by default the upper levels will be picked up. That mean for a particular list of ids all with get a Bacteria assigment. Don´t know if that can be fixed ?

In this case for example the 609216830 that is a Streptoccus is matched to Bacteria. I would expect an assignation to the genus family/genus level at least.

Wei Shen · Answer 10 · Thu Dec 01 2022 22:30:19 GMT+0800 (China Standard Time)

So, mapping only IDs to taxa of species rank is reasonable.

Back to the previous concern, why did you want map IDs to the genus Staphylococcus?

For a particular id (1824050977) i don´t have a corresponding mapping (see shot)

Mapping to species is enough, cause the complete lineage of the species can be retrieved with the taxid of the species.

Could you please show how you query the LCA? What were the TaxIDs used? What's the direct purpose?

davidvilanova · Answer 11 · Thu Dec 01 2022 22:39:43 GMT+0800 (China Standard Time)

Yes sure,
What i´m trying to achieve is to annotate with Silva a set of nanopore reads.
For that purpose i´m mapping one metagenomic sample reads to the Silva 16S database with minimap allowing for error tolerance as nanopore reads tend to give an error rate. For each read mapped to silva i get the first 10 hits.

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

For some reads the resolution (because of the error rate) cannot be achieved at the species level, that´s why in some cases(like the Staphyloccocus case above) the best hit (output from LCA) is a genus. But could be family or order or class or phyla.

Does this makes sense?

Wei Shen · Answer 12 · Thu Dec 01 2022 23:20:19 GMT+0800 (China Standard Time)

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Good, here you've assigned LCA to each read.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

No, you can just use taxonkit lineage or taxonkit reformat -I to retrieve lineage via the LCAs, no matter what the rank they are. There's no need to query with the taxid.map file.

davidvilanova · Answer 13 · Sun Dec 04 2022 02:59:38 GMT+0800 (China Standard Time)

Thanks for the update, there is one strange behaviour for lca. I was expecting LCA to go up to the staphylococcus genus level but it went up to bacilli. I assume something is wrong with my database.

Here is the pipeline to build the database:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz

gunzip  SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz 
infile="SILVA_138.1_SSURef_NR99_tax_silva.fasta"

#Remove unwanted taxa
zcat  ${infile} | grep ">" |grep  -v 'Archaea\Eukaryota\|unidentified\|metagenomes\|environmental samples' >tab_silva_taxonomy

sed -i 's/>//g' tab_silva_taxonomy #replace ">"
sed -i 's/ /@@/' tab_silva_taxonomy # replace first space with @@
sed -i 's/ /_/g' tab_silva_taxonomy # replace all spaces with "_" mainly on species names
sed -i 's/@@/'$'\t''/' tab_silva_taxonomy # restore tab after accession
sed -i 's/;/'$'\t''/g' tab_silva_taxonomy # replace commas with semicolons
awk 'NF>7' tab_silva_taxonomy | cut -f 1-8 > tmp
mv tmp tab_silva_taxonomy
taxonkit create-taxdump tab_silva_taxonomy -O taxdump --force -RR "superkingdom,phylum,class,order,family,genus,species" -A 1

Don´t what is wrong...

Wei Shen · Answer 14 · Sun Dec 04 2022 11:18:27 GMT+0800 (China Standard Time)

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 3
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;813944714;95949142;368069282;1202746109
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;485431882

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 2
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Staphylococcus_saprophyticus
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sp._mixed_culture_J3-

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | sort | uniq  \
    | axonkit reformat --data-dir taxdump/ -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species"  \
    | csvtk pretty -t
taxid        kingdom    phylum       class     order              family              genus            species
----------   --------   ----------   -------   ----------------   -----------------   --------------   -------------------------------------
1202746109   Bacteria   Firmicutes   Bacilli   Bacillales         Planococcaceae      Sporosarcina     Staphylococcus_saprophyticus
1956460793   Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sciuri
485431882    Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sp._mixed_culture_J3-3
883645126    Bacteria   Firmicutes   Bacilli   Bacillales         Bacillaceae         Bacillus         Streptococcus_pneumoniae

The LCA seems to be 1845768359 (Bacilli), without a doubt.

BTW, the process could be simplified.

    seqkit seq -n SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz \
        | grep -v 'Archaea\|Eukaryota\|unidentified\|metagenomes\|environmental samples' \
        | sed 's/ /\t/' \
        | sed 's/;/\t/g' \
        | awk -F '\t' 'NF > 7' \
        > silva_taxonomy.tsv

    taxonkit create-taxdump silva_taxonomy.tsv -O taxdump \
        --force -R "superkingdom,phylum,class,order,family,genus,species" -A 1