marbl / Krona

Interactively explore metagenomes and more from a web browser.

Home Page:https://github.com/marbl/Krona/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gi less taxonomic identifiers

LoreVE opened this issue · comments

I have been in contact with NCBI regarding because some (a lot) of proteins were missing in the prot.accession2taxid file. Hence, they did not have a corresponding taxID and Krona Tools could not find them back in the all.accession2taxid.sorted file (as the prot.accession2taxid was incomplete).

The problem with missing accessions in the prot.accession2taxid file was due to NCBI's switching to gi-less records. The missing proteins are those that have accession numbers only but are without the gi identifiers. However the internal processing for prot.accession2taxid file actually depends on the gi identifiers, hence the missing entries.
(I am guessing this might be causing issue #143 too... )

The developers have been working on changes in processing that include gi-less accessions and made a version available last week that they would like me to test (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.20201013_0121.gz)

Unfortunately I cannot seem to get the accessions database in the correct format. I downloaded all necessary files and replaced the old prot.accession2taxid file with the new one (prot.accession2taxid.20201013_0121) and then ran updateAccessions.sh --only-build. Grepping the protein accessions in the all.accession2taxid.sorted file works, but running ktClassifyBLAST, ktImportBLAST, ... returns root for all entries and says I should update the database as accessions were not found. I was wondering if you were able to help me (especially since this new format will replace the old one soon)?

Thanks in advance!

Seems like @LoreVE is on to the root of the issue.
Any idea if this can be easily resolved?

I've pushed an update that will ignore the unknown accessions instead of treating them as root. Note that this could cause overconfident classifications in some cases, but it seems like it is necessary for the current state of the NCBI taxonomy databases. The old behavior can be restored with -f. This is available now by cloning from master, and will be in a release once it is tested a little more.

commented

@LoreVE, thank you for all your efforts! I am experiencing the same issue, a lot of sequences end up in the root.
Hopefully a solution will be found soon to make NCBI databases and KronaTools work nicely together again.

Using diamond with the NR database, and later feeding the output to Krona, I have noticed a lot of unclassified hits. Digging a little into the issue, I found out NCBI releases two versions of the prot.accession2taxid list:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

and

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz

Most (probably all, though I didn't check all accessions) unclassified accessions were found at the prot.accession2taxid.FULL table, but missing from the prot.accession2taxid file. Unfortunately, both files do not follow the same format convention, so it is not possible to hack a solution by just renaming prot.accession2taxid.FULL.gz to prot.accession2taxid.gz and then running:

./updateAccessions.sh --only-build --preserve

@hmontenegro the hack you propose should work with the latest release (v2.8). If NCBI makes the "FULL" file the default in the future, this release should transition to the new format seamlessly. Thanks to @LoreVE for help with this.

@ondovb and @LoreVE Great, thanks!

DIAMOND also released today a version with support for prot.accession2taxid.FULL.gz taxonomy mapping, so both ends are covered now.

commented

I am using Kraken2 output for creating a krona chart, having trouble with lots of unassigned TaxoID. Like mentioned below

[ WARNING ] The following taxonomy IDs were not found in the local database and were set to root (if they were recently added to NCBI, use updateTaxonomy.sh to update the local database): 3138 26344
545612 25130 26808 2884 47987 24452 2669 25698 44379 1739 6516 12656 24451 11347 26358 24708 26462 24243 26524 24829 25142 2668 24342 24634 24489 47226 13042 26736 2073 24434 14349 11353
42989 26982 26361 5510 8775 24341 26859 26550 24366 12883 25574 46522 45605 24572 24367 2328 42982 1592 26418 767 25755 27062 2539 43102 25006 902 26357 14410 24783 14375 14930 24561 1089
25539 24696 12925 24447 26397 14685 24238 24454 26936 44858 1554 11474 24658 26002 26151 4 14630 295 26099 284 26767 25760 25688 24734 24618 26098 805 14189 2827 26922 36 26916 45274
25567 27258 13292 24819 6555 26684 25910 25803 26409 1134 25854 26287 24990 26903 11539 25787 402 26532 43556 24567 25863 25747 2710 1065 46463 25811 13222 14193 24383 392 1660064 695
2169539 24473 25576 24438 1342 25865 24344 14100 5145 26988 1043 25746 1738 792 606 2218 25031 11475 42922 9921 8925 11409 26605 2847 13155 25884 11502 24435 43344 12909 1826873 43141
1674 744 13300 24980 2022 14692 119065 25555 45138 13263 25551 44540 2027 26368 874 25759 12634 44863 2032 1849491 26935 25769 1981981 1899 25862 45997 26096 47380 13367 2844 25798 26382
514 45674 26484 11424 44855 24881 1438 1987 24412 25543 25780 2324 25015 26338 26198 13869 46480 14196 26891 1761 47487 45532 25557 43527 44856 25785 25948 2444 24633 1470 711 13374 12884
./updateTaxonomy.sh is not working.
Any solution pls

Thanks