raqmejtru / SRAlign

a flexible pipeline for short read alignment to a reference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SRAlign

A flexible pipeline for short read alignment to a reference with extensive QC reporting.

Introduction

SRAlign is a Nextflow pipeline for aligning short reads to a reference.

SRAlign is designed to be highly flexible by allowing for the easy addition of tools to the pipeline as well as serving as a starting point for genomic analyses that rely on alignment of short reads to a reference.

Pipeline overview

  1. Trim reads
  2. QC of reads
    1. Raw reads FastQC
    2. Trim reads FastQC
    3. Summary MultiQC
  3. Align reads
    1. Align to reference genome/transcriptome
    2. Check contamination
  4. Preprocess alignments
    1. Mark duplicates
    2. Compress sam to bam
    3. Index bam
  5. QC of alignments
    1. samtools stats
    2. Samtools index stats
    3. Percent duplicates
    4. Percent aligned to contamination reference
    5. Summary MultiQC
  6. Library complexity and reproducibility
    1. Preseq library complexity
    2. DeepTools correlation
    3. DeepTools PCA
  7. Full pipeline MultiQC

Quick start

Prerequisites

  1. Any POSIX compatible system (e.g. Linux, OS X, etc) with internet access

  2. Nextflow version >= 21.04

  3. Docker

    • I recommend Docker Desktop for OS X or Windows users

Get or update SRAlign

  1. Download or update SRAlign:

    • Downloads the project into $HOME/.nextflow/assets
    • Useful for quickly downloading and easily running a project.
      • Allows for accessing SRAlign using Nextflow command by simply referring to trev-f/SRAlign without having to refer to the location of SRAlign in the system.
      • To customize or expand SRAlign, see the documentation on customizing or expanding SRAlign.
    nextflow pull trev-f/SRAlign
  2. Show project info:

    nextflow info trev-f/SRAlign

Test SRAlign

  1. Check that SRAlign works on your system:

    • -profile test uses preconfigured test parameters to run SRAlign in full on a small test dataset stored in a remote GitHub repository.
      • Because these test files are stored in a remote repository, internet access is required to run the test.
      • For more information, see the profiles section of the nextflow config file and trev-f/SRAlign-test.
    nextflow run trev-f/SRAlign -profile test 

Run SRAlign

  1. Prepare the input design csv file.

    • Input design file must be in csv format with no whitespace.
    • Either reads (fastq or fastq.gz) or alignments (bam) are accepted.
      • If reads are supplied, can be paired or unpaired.
    • Required columns:
      • reads: lib_ID, sample_name, replicate, reads1, reads2 (optional)
      • alignments: lib_ID, sample_name, replicate, bam, tool_IDs
    • See sample inputs in the SRAlign-test repository.
    • A template project repository can be downloaded from the SRAlign-template repository.
  2. Show all configurable options for SRAlign by showing a help message:

    • The most important information here is probably the list of available reference genomes.
    nextflow run trev-f/SRAlign --help
  3. Analyze your data with SRAlign:

    nextflow run trev-f/SRAlign -profile docker --input <input.csv> --genome <valid genome key>

Tips for running Nextflow and SRAlign

SRAlign is designed to be highly configurable, meaning that its default behavior can be changed by supplying any of a number of configurable parameters. These can be supplied in a number of ways that have a specific hierarchy of precedence.

  • Show configurable parameters by showing command line help documentation: nextflow run trev-f/SRAlign --help
  • Nextflow arguments always begin with a single dash, e.g. -profile.
  • Pipeline parameters specified at the command line always begin with a double dash, e.g. --input.
    • Parameters specified at the command line always have the highest precedence. They will overwrite parameters specified in any config or params files.
    • I recommend specifying required parameters (i.e. --input and --genome) and up to a few others at the command line in this manner. Specifying more than this at the command line gets unwieldy.
  • A custom config or parameters file is a good option for cases where you want to supply more parameters than can comfortably be done at the command line or you want to use the same custom parameters in multiple runs.

Additional documentation

Additional documentation can be found in docs.

Quick links:

About

a flexible pipeline for short read alignment to a reference

License:MIT License


Languages

Language:Nextflow 58.9%Language:Python 27.0%Language:Groovy 14.1%