acorg/ncbi-taxonomy-database

Creating a database from NCBI taxonomy data

Here are brief instructions from the re-factoring done in August 2019 to use accession numbers not GI numbers. The original notes are below.

Downloading the data files needed to build the taxonomy database

In the data directory it is expected that you will have

The *.dmp NCBI taxonomy files from the tarball here, and
The nucl_gb.accession2taxid.gz file from here

You can either get these by hand or else just run make download in the data directory.

Building the taxonomy database

Once you have the data files in place, you can just run make, which will create you a (currently 17GB) Sqlite3 database file, taxonomy.db that can be used with the AccessionLineageFetcher Python class in dark-matter, or with similar code that you write yourself.

You can also just run make xxx (where xxx is one of taxids, nodes, names or hosts) in case you just want to recreate one of the tables in the database.

If you want to do something else, you're on your own for the time being! But the scripts in this directory, the Makefile in the data directory, and the dark matter code will hopefully be instructive.

Original README text for sqlite and mysql

Here are scripts to help you create a database (mysql or sqlite) from some of the NCBI's taxonomy data. This can be used with the LineageFetcher Python class in dark-matter, or with similar code that you write yourself.

Nucleotides, proteins, or both?

There are two large files on the NCBI FTP site and you may not want them both. You'll need at least one of them. It all depends on the gi numbers you want to be able to look up taxonomy information for.

The download script and database creation scripts assume you want both files. If you don't, you can edit these scripts to remove the file you don't need.

To change download.sh just remove one of the file names from the line that says for file in gi_taxid_nucl.dmp gi_taxid_prot.dmp. To change the create scripts, delete the line that imports the data from the file you don't want (gi_taxid_nucl.dmp or gi_taxid_prot.dmp).

Downloading data

The download.sh script will download all the file you need from the NCBI ftp site. If you already have what's needed (at least one of gi_taxid_nucl.dmp.gz, gi_taxid_prot.dmp.gz, and taxdump.tar.gz) you can skip this step, though you will need to uncompress the first two and extract names.dmp and nodes.dmp from taxdump.tar.gz using e.g.,

$ gunzip gi_*.gz
$ tar xfz taxdump.tar.gz names.dmp nodes.dmp

Building the database

Adding all the data to the databases takes a lot of time. (And yes, the scripts load the data from the files before adding indices to the database tables in case you're wondering). It could take some hours. The input files are big (as on Sept 29, 2018):

$ du -s -h gi_taxid_*
11G     gi_taxid_nucl.dmp
8.9G    gi_taxid_prot.dmp

Sqlite3

Run:

$ create-sqlite.sh ncbi-taxonomy-sqlite.db

to make the database file (or give your own database filename on the command line). On my machine this creates a 38GB database file.

Mysql