nknox / covid-19-signal

Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL)

This snakemake pipeline is compatible with the illumina artic nf pipeline. It performs the same consensus and variant calling procedure using ivar. In addition it adds screening with Kraken2/LMAT, enhanced contamination removal, and additional breseq mutation detection. By default the SARS-CoV2 reference genome: MN908947.3 is used throughout the analysis. This can be changed in the data dependencies download script (script/get_data_dependencies.sh) and updating the config.yaml accordingly. Similarly, the default sequencing primer and trimming settings can easily be added and adjusted in the config.yaml. See below for full details.

Future enhancements are intended to aid/automate metadata management in accordance with PHA4GE guidelines, and manage upload to GISAID and INDSC compatible BioSamples.

Setup/Execution

  1. Clone the git repository

     git clone https://github.com/jaleezyy/covid-19-signal
    
  2. Install conda and snakemake (version >5) e.g.

     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
     bash Miniconda3-latest-Linux-x86_64.sh # follow instructions
     source $(conda info --base)/etc/profile.d/conda.sh
     conda create -n snakemake snakemake=5.11.2
     conda activate snakemake
    

Additional software dependencies are managed directly by snakemake using conda environment files:

  1. Download necessary database files

The pipeline requires:

  • Amplicon primer scheme sequences (*)
  • Nextera sequencing primer sequence files from trimmomatic
  • SARS-CoV2 reference fasta
  • SARS-CoV2 reference gbk
  • SARS-CoV2 reference gff3
  • kraken2 viral database
  • LMAT kML+Human.v4-14.20.g10.db database

All dependencies except the amplicon primers (*) can be automatically fetched using the follow accessory script:

    bash pipeline/scripts/get_data_dependencies.sh -d data -a MN908947.3
  1. Configure your config.yaml file

Either using the convenience python script (pending) or through modifying the pipeline/example_config.yaml to suit your system

  1. Specify your samples in CSV format (e.g. sample_table.csv)

See the example table pipeline/example_sample_table.csv for an idea of how to organise this table.

  1. Execute pipeline (optionally explicitly specify --cores):

     snakemake --use-conda -s Snakefile --cores $(nproc) all
    

Docker (pending)

Alternatively, the pipeline can be deployed using Docker (see resources/Dockerfile_pipeline for specification). To pull from dockerhub:

    docker pull finlaymaguire/signal

Download data dependencies:

    mkdir -p data && docker run -v $PWD/data:/data finlaymaguire/signal:1.0.0 bash scripts/get_data_dependencies.sh -d /data

Add remaining files (e.g. primers) to your config and sample table in the data directory:

    cp config.yaml sample_table.csv $PWD/data && \ 
        docker run -v $PWD/data:/data finlaymaguire/signal:1.0.0 mv data/config.yaml data/sample_table.csv .

Then execute the pipeline:

    docker run -v $PWD/data:/data finlaymaguire/signal:1.0.0 conda run -n snakemake snakemake --use-conda --conda-prefix $HOME/.snakemake --cores 8 -s Snakefile all

Summaries:

  • Generate summaries of BreSeq among many samples, see

Pipeline details:

For a step-by-step walkthrough of the pipeline, see pipeline/README.md.

A diagram of the workflow is shown below.

Workflow Version 5

About

Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.


Languages

Language:Python 57.9%Language:Shell 33.5%Language:Perl 8.6%