This is the repository of the database curation pipeline for (meta-)barcoding for ITS2 vascular plants. The corresponding article is available as preview while being under review: https://www.biorxiv.org/content/10.1101/2023.06.12.544582v1
These functions are intented for usage with databases that have a taxonomy stored in a specific format used for classifiers. Such databases can be created using BCdatabase: https://github.com/molbiodiv/bcdatabaser
Documentation: https://molbiodiv.github.io/bcdatabaser/
And particuarly syntax information: https://molbiodiv.github.io/bcdatabaser/output.html
Software dependencies are declared in bin/externals.txt
These are
- Seqfilter: https://github.com/BioInf-Wuerzburg/SeqFilter
- NCBI eUtils Command line Tools: https://www.ncbi.nlm.nih.gov/books/NBK179288/
- VSEARCH: https://github.com/torognes/vsearch
This was tested under
- Mac OSX 11.0.1
- Ubuntu 20.04
- Fungal removal
- Non-target (non-ITS2) sequence removal
- Removing incomplete taxonomies
- Chlorophyta removal
- Identify and remove iterative intra-spec outliers
(Details on these filters are provided in the article above)
This function performs the automated curation:
bash /bin/_curation.sh YOUR.DB.NAME.fa
- taxonomy corrections
- sequence removal
- Place a
.txt
in the format as in the examples into the foldercorrections
. The format is
NCBI-Accession;Wrong_ScientificName;Corrected_ScientificName;Your_Name
Multiple separate files can be made, all .txt
files in that folder will be used for corrections.
- Then call the function on your database
bash /bin/_correct_manuals.sh YOUR.DB.NAME.fa
This can take a while for large databases.
- adding taxonomy and appending to DB
- Place one or more
.fasta
in the format as in the examples into the folderadditions
. The format is
>Scientific_name
ACGT
Multiple separate files can be made, all .fasta
files in that folder will be used for additions.
- Then call the function on your database
bash /bin/_add_manuals.sh YOUR.DB.NAME.fa
This can take a while for large number of sequences.
Subsetting your input database into a geographically database
SeqFilter --ids_pattern LOCAL.FLORA.csv YOUR.DB.NAME.fa -o LOCAL.FLORA.DB.fa