NB! master branch is used as development branch. Please check out Releases to download a specific version of the SH matching tool.
Developed as part of EOSC-Nordic project (task 5.2.1: Cross-border data processing workflows), UNITE SH matching analysis is a digital service for the global species discovery from eDNA (environmental DNA). SH matching service is based on the UNITE datasets hosted in PlutoF. Its output includes information about what species are present in eDNA samples, are they potentially undescribed new species, where are they found in other studies, are they alien or threatened species, etc. The output will provide DOI (Digital Object Identifier) based stable identifiers for the communicating species found in eDNA. DOIs are connected to the taxonomic backbone of PlutoF and GBIF. In this way every DOI is accompanied by a taxon name which is still widely used for the communication of species. In the case of undescribed species, DOIs will soon be issued by the PlutoF system (only if SH matching service integrated with the PlutoF platform is used for the analysis). SH matching service covers all Eukaryota by using rDNA ITS marker sequences accompanied by sample metadata.
The script expects input files in FASTA format. Outdata files are described in sh_matching_analysis/readme.txt.
- Singularity - install Singularity (tested with version 3.5) and obtain API key for remote build
-
Create Singularity Image File (SIF)
sudo singularity build sh_matching.sif sh_matching.def
-
OPTIONAL: Copy SIF to HPC
scp sh_matching.sif example_hpc_user@example.com:
-
Create input, output and working data directories
mkdir userdir mkdir indata mkdir outdata
-
Download FASTA dbs (https://app.plutof.ut.ee/filerepository/view/6884701) and create UDB formatted dbs
wget https://s3.hpc.ut.ee/plutof-public/original/d3d8b3de-83af-4fb5-b82b-359f7b730f84.zip mv d3d8b3de-83af-4fb5-b82b-359f7b730f84.zip sh_matching_data_udb_0_5.zip unzip sh_matching_data_udb_0_5.zip rm sh_matching_data_udb_0_5.zip cd data_udb/ vsearch --makeudb_usearch sanger_refs_sh.fasta --output sanger_refs_sh.udb rm sanger_refs_sh.fasta vsearch --makeudb_usearch sanger_refs_sh_full.fasta --output sanger_refs_sh_full.udb rm sanger_refs_sh_full.fasta
NB! The script expects input files in FASTA format, named as source_[run_id] and placed in indata/ directory. Outdata files are described in sh_matching_analysis/readme.txt.
- Run the pipeline using SIF (example data with -
-
run_id=11
-
region=itsfull[default]|its2
-
itsx_step=yes[default]|no - flag indicating whether to include the ITSx step in the analysis (default, "yes")
-
remove_userdir=yes[default]|no - flag indicating whether to delete the user directory upon pipeline completion (default, "yes")
-
include_vsearch_step=yes|no[default] - flag indicating whether to include the vsearch substring dereplication step (default, "no")
-
conduct_usearch_05_step=yes|no[default] - flag indicating whether to conduct the usearch complete-linkage clustering at 0.5% dissimilarity (default, "no")
./sh_matching.sif /sh_matching/run_pipeline.sh 11 itsfull yes yes no no
When using this resource, please cite as:
Abarenkov K, Kõljalg U, Nilsson RH (2022) UNITE Species Hypotheses Matching Analysis. Biodiversity Information Science and Standards 6: e93856. https://doi.org/10.3897/biss.6.93856
The work is supported by EOSC-Nordic and the Estonian Research Council grant (PRG1170).