SH MATCHING analysis tool

NB! master branch is used as development branch. Please check out Releases to download a specific version of the SH matching tool.

Developed as part of EOSC-Nordic project (task 5.2.1: Cross-border data processing workflows), UNITE SH matching analysis is a digital service for the global species discovery from eDNA (environmental DNA). SH matching service is based on the UNITE datasets hosted in PlutoF. Its output includes information about what species are present in eDNA samples, are they potentially undescribed new species, where are they found in other studies, are they alien or threatened species, etc. The output will provide DOI (Digital Object Identifier) based stable identifiers for the communicating species found in eDNA. DOIs are connected to the taxonomic backbone of PlutoF and GBIF. In this way every DOI is accompanied by a taxon name which is still widely used for the communication of species. In the case of undescribed species, DOIs will soon be issued by the PlutoF system (only if SH matching service integrated with the PlutoF platform is used for the analysis). SH matching service covers all Eukaryota by using rDNA ITS marker sequences accompanied by sample metadata.

The script expects input files in FASTA format. Outdata files are described in sh_matching_analysis/readme.txt.

Third-party software used by this tool

Setup

Pre-requisites

Singularity - install Singularity (tested with version 3.5) and obtain API key for remote build

Setup steps

Create Singularity Image File (SIF)

sudo singularity build sh_matching.sif sh_matching.def

OPTIONAL: Copy SIF to HPC

scp sh_matching.sif example_hpc_user@example.com:

Create input, output and working data directories
```
mkdir userdir
mkdir indata
mkdir outdata
```

Download FASTA dbs (https://app.plutof.ut.ee/filerepository/view/6864682) and create UDB formatted dbs

wget https://s3.hpc.ut.ee/plutof-public/original/e7d901ef-5940-482c-85f8-0473ce86df0b.zip
mv e7d901ef-5940-482c-85f8-0473ce86df0b.zip sh_matching_data_udb_0_5.zip
unzip sh_matching_data_udb_0_5.zip
rm sh_matching_data_udb_0_5.zip
cd data_udb/
vsearch --makeudb_usearch sanger_refs_sh.fasta --output sanger_refs_sh.udb
rm sanger_refs_sh.fasta
vsearch --makeudb_usearch sanger_refs_sh_full.fasta --output sanger_refs_sh_full.udb
rm sanger_refs_sh_full.fasta

Running the analysis

NB! The script expects input files in FASTA format, named as source_[run_id] and placed in indata/ directory. Outdata files are described in sh_matching_analysis/readme.txt.

Run the pipeline using SIF (example data with -

run_id=11
region=itsfull[default]|its2
itsx_step=yes[default]|no - flag indicating whether to include the ITSx step in the analysis (default, "yes")
remove_userdir=yes[default]|no - flag indicating whether to delete the user directory upon pipeline completion (default, "yes")
include_vsearch_step=yes|no[default] - flag indicating whether to include the vsearch substring dereplication step (default, "no")
conduct_usearch_05_step=yes|no[default] - flag indicating whether to conduct the usearch complete-linkage clustering at 0.5% dissimilarity (default, "no")
```
./sh_matching.sif /sh_matching/run_pipeline.sh 11 itsfull yes yes no no
```

Citing

When using this resource, please cite as:

Abarenkov K, Kõljalg U, Nilsson RH (2022) UNITE Species Hypotheses Matching Analysis. Biodiversity Information Science and Standards 6: e93856. https://doi.org/10.3897/biss.6.93856

Funding

The work is supported by EOSC-Nordic and the Estonian Research Council grant (PRG1170).

vmikk / sh_matching_pub