shenwei356 / gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about tax ID generation

dgolden96 opened this issue · comments

Hello! I'm having a little bit of trouble understanding how the tax IDs are generated from taxonomic data in gtdb-taxdump. I don't have any prior experience with hashing, and I'm having a little trouble finding documentation explaining how xxhash works. Could you expand a little bit on the steps for using it to generate tax IDs from taxonomy data? Thank you!

Hi Daniel, xxhash is one of hash functions which converts text (NCBI assembly accession) or byte array (the low-level storage form of text and numbers) to 32-bit or 64-bit unassigned integers, called hashes or hash values. Here we hash the taxon name (in lower case) of each taxon node to uint64 using xxhash and convert it to uint32.

A hash function is stable, which means the same texts always return the same hash values. So in different GTDB versions, the TaxId of a taxon node will keep the same if the name is not changed.

Note that different keys may get the same hash value, this is called a collision. The chance is very low but we detect it and assign a different value in taxonkit create-taxdump.