CNuge / data-alfie

labelled COI-5P barcode data used in training alfie

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Aflie COI training data

These files contain a set of publicly available DNA barcode sequences obtained from BOLD, used in the training of the alignment-free, kingdom level taxonomic classifier: alfie.

The filename indicates the kingdom from which the barcode sequences are derived. Each file contains the following columns:

processid		- unique ID for relating sequence back to the BOLD database
clean_dna		- the DNA barcode sequence for the specimen
kingdom			- subsequent columns are the taxonomic classifications
phylum	
class	
order	
family	
subfamily	
genus 			
species			- note: this column may be sparsely populated, not all bold records are ID'ed to a species level

Extract the data

The tar.bz2 format was used to get the files compressed so they fit under the 100MB limit of GitHub (not as good as middle-out compression but it does the job).

Try double clicking the the compressed files to initiate decompression. If that doesn't work, the following command can be used to decompress the data from the command line:

#animal
tar -jxvf animal_coi_clean.tar.bz2

#all at once
tar -jxvf *_coi_clean.tar.bz2

About

labelled COI-5P barcode data used in training alfie