TU-NHM / sh_matching_pub

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SH MATCHING analysis tool

run with singularity Github_Status_Badge GitHub license

NB! master branch is used as development branch. Please check out Releases to download a specific version of the SH matching tool.

Developed as part of EOSC-Nordic project (task 5.2.1: Cross-border data processing workflows), UNITE SH matching analysis is a digital service for the global species discovery from eDNA (environmental DNA). SH matching service is based on the UNITE datasets hosted in PlutoF. Its output includes information about what species are present in eDNA samples, are they potentially undescribed new species, where are they found in other studies, are they alien or threatened species, etc. The output will provide DOI (Digital Object Identifier) based stable identifiers for the communicating species found in eDNA. DOIs are connected to the taxonomic backbone of PlutoF and GBIF. In this way every DOI is accompanied by a taxon name which is still widely used for the communication of species. In the case of undescribed species, DOIs will soon be issued by the PlutoF system (only if SH matching service integrated with the PlutoF platform is used for the analysis). SH matching service covers all Eukaryota by using rDNA ITS marker sequences accompanied by sample metadata.

The script expects input files in FASTA format. Outdata files are described in sh_matching_analysis/readme.txt.

Third-party software used by this tool

Setup

Pre-requisites

  • Singularity - install Singularity (tested with version 3.5) and obtain API key for remote build

Setup steps

  1. Create Singularity Image File (SIF)

    sudo singularity build sh_matching.sif sh_matching.def
  2. OPTIONAL: Copy SIF to HPC

    scp sh_matching.sif example_hpc_user@example.com:
  3. Create input, output and working data directories

    mkdir userdir
    mkdir indata
    mkdir outdata
  4. Download FASTA dbs (https://app.plutof.ut.ee/filerepository/view/6884701) and create UDB formatted dbs

    wget https://s3.hpc.ut.ee/plutof-public/original/d3d8b3de-83af-4fb5-b82b-359f7b730f84.zip
    mv d3d8b3de-83af-4fb5-b82b-359f7b730f84.zip sh_matching_data_udb_0_5.zip
    unzip sh_matching_data_udb_0_5.zip
    rm sh_matching_data_udb_0_5.zip
    cd data_udb/
    vsearch --makeudb_usearch sanger_refs_sh.fasta --output sanger_refs_sh.udb
    rm sanger_refs_sh.fasta
    vsearch --makeudb_usearch sanger_refs_sh_full.fasta --output sanger_refs_sh_full.udb
    rm sanger_refs_sh_full.fasta

Running the analysis

NB! The script expects input files in FASTA format, named as source_[run_id] and placed in indata/ directory. Outdata files are described in sh_matching_analysis/readme.txt.

  1. Run the pipeline using SIF (example data with -
  • run_id=11

  • region=itsfull[default]|its2

  • itsx_step=yes[default]|no - flag indicating whether to include the ITSx step in the analysis (default, "yes")

  • remove_userdir=yes[default]|no - flag indicating whether to delete the user directory upon pipeline completion (default, "yes")

  • include_vsearch_step=yes|no[default] - flag indicating whether to include the vsearch substring dereplication step (default, "no")

  • conduct_usearch_05_step=yes|no[default] - flag indicating whether to conduct the usearch complete-linkage clustering at 0.5% dissimilarity (default, "no")

    ./sh_matching.sif /sh_matching/run_pipeline.sh 11 itsfull yes yes no no

Citing

When using this resource, please cite as:

Abarenkov K, Kõljalg U, Nilsson RH (2022) UNITE Species Hypotheses Matching Analysis. Biodiversity Information Science and Standards 6: e93856. https://doi.org/10.3897/biss.6.93856

Funding

The work is supported by EOSC-Nordic and the Estonian Research Council grant (PRG1170).

About

License:MIT License


Languages

Language:Python 82.6%Language:Shell 10.0%Language:Perl 7.3%