katwre / pigx_bsseq

bisulfite sequencing pipeline from fastq to methylation reports

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PiGx Logo

Copyright 2017: Alexander Gosdschan, Katarzyna Wreczycka, Bren Osberg, Ricardo Wurmus. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.


Summary

PiGx is a data processing pipeline for raw fastq read data of bisulfite experiments; it produces reports on aggregate methylation and coverage and can be used to produce information on differential methylation and segmentation. It was first developed by the Akalin group at MDC in Berlin in 2017.

The figure below provides a sketch of the process.

Install

PiGx uses the GNU build system. If you want to install PiGx from source (here you can find the latest release), please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:

./configure --prefix=/some/where
make install

Dependencies

By default the configure script expects tools to be in a directory listed in the PATH environment variable. If the tools are installed in a location that is not on the PATH you can tell the configure script about them with variables. Run ./configure --help for a list of all variables and options.

The following tools must be available:

All of these dependencies must be present in the environment at configuration time.

Installation of dependecies via Guix

You can install PiGx through Guix (TODO: add details here after release).

Run the configure script to probe your environment for tools needed by the pipeline. If you cannot be bothered to install all packages manually, we recommend using GNU Guix. The following command spawns a sub-shell in which all dependencies are available:

guix environment -l guix.scm

Getting started

To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type

$ pigx_bs [options]

To see all available options type the --help option

$ pigx_bs --help

usage: pigx_bs [-h] [-v] [-p PROGRAMS] [-c CONFIGFILE] [-s SNAKEPARAMS]
               tablesheet

PiGx BSseq Pipeline.

PiGx is a data processing pipeline for raw fastq read data of
bisulfite experiments.  It produces methylation and coverage
information and can be used to produce information on differential
methylation and segmentation.

positional arguments:
  tablesheet                                 The tablesheet containing the basic configuration information for
                                             running the pipeline.

optional arguments:
  -h, --help                                 show this help message and exit
  -v, --version                              show program's version number and exit
  -p PROGRAMS, --programs PROGRAMS           A JSON file containing the absolute paths of the required tools.
  -c CONFIGFILE, --configfile CONFIGFILE     The config file used for calling the underlying snakemake process.  By
                                             default the file 'config.json' is dynamically created from tablesheet
                                             and programs file.
  -s SNAKEPARAMS, --snakeparams SNAKEPARAMS  Additional parameters to be passed down to snakemake, e.g.
                                                 --dryrun    do not execute anything
                                                 --forceall  re-run the whole pipeline

Input parameters

The input parameters specifying the desired behaviour of PiGx should be entered into the tablesheet file. When PiGx is run, the data from this file will be used to automatically generate a configuration file.

Here is an example tablesheet:

[ GENERAL PARAMETERS ]
PATHIN="in/"
PATHOUT="out/"
GENOMEPATH="genome/"
GENOME_VERSION="hg19"
bismark_args=" -N 0 -L 20 "
fastqc_args=""
trim_galore_args=""
bam_methCall_args_mincov="0"
bam_methCall_args_minqual="10"
NICE="19"
numjobs="6"
cluster_run="FALSE"
contact_email="NONE"
bismark_cores="3"
bismark_MEM="19G"
MEM_default="8G"
qname="all"
h_stack="128m"
diffmeth_cores="20"


[ SAMPLES ]
Read1,Read2,SampleID,ReadType,Treatment
PE_1.fq.gz,PE_2.fq.gz,PEsample,WGBS,0
SE_techrep1.fq.gz,,SEsample,WGBS,1
SE_techrep2.fq.gz,,SEsample_v2,WGBS,2

[ DIFFERENTIAL METHYLATION ]
0, 1

The tablesheet contains 3 paragraphs:

  • general parameters,
  • a table with sample specific information containing the names of fastq files, unique sample ids, the type of bisulfite sequencing experiment (could be RRBS or WGBS,only WGBS is available right now) and treatment group for differential methylation detection
  • treatment groups considered for differential methylation detection

Details about General Parameters

General parameters have to contain variables:

Click to expand explanations
Variable name description
PATHIN string: location of the experimental\nall input data files (.fastq[.gz|.bz2])
PATHOUT string: ultimate location of the output data and report files
GENOMEPATH string: location of the reference genome data for alignment
GENOME_VERSION string: an UCSC assembly release name e.g. "hg19"
bismark_args string: optional arguments supplied to bismark during alignment. See the [Bismark User Guide], e.g. " -N 0 -L 20 "
fastqc_args string: optional arguments supplied to FastQC during alignment. See the FastQC, e.g. ""
trim_galore_args string: optional arguments supplied to Trim Galore! during alignment. See the Trim Galore! e.g. ""
bam_methCall_args_mincov string: minimum read coverage to be included in the methylKit objects. defaults to 10. Any methylated base/region in the text files below the mincov value will be ignored.
bam_methCall_args_minqual string: minimum phred quality score to call a methylation status for a base, e.g. "10"
cluster_run string: a boolean whether the pipeline should be run on cluster, e.g. "FALSE"
numjobs string: number of jobs sent to cluster, e.g. "6"
contact_email string: email address to which information about cluster job is sent
bismark_cores string: number of cores used by bismark, e.g. "3"
bismark_MEM string: amount of memory used by bismark, e.g. "19G"
MEM_default string: amount of memory used for all jobs besides bismark, e.g. "8G"
qname string: queue name (used for cluster jobs), e.g. "all"
h_stack string: stack size limit (used for cluster jobs), e.g. "128m"
diffmeth_cores integer: denoting how many cores should be used for parallel differential methylation calculations
NICE integer: from -20 to 19; higher values make the program execution less demanding on computational resources

Make sure that all input files (paired or single end) are present in the folder indicated by PATHIN. All output produced by the pipeline will written to the folder indicated by PATHOUT, with subdirectories corresponding to the various stages of the process. The directory pointed to by GENOMEPATH has to contain the reference genome being mapped to.

About

bisulfite sequencing pipeline from fastq to methylation reports

License:GNU General Public License v3.0


Languages

Language:Shell 24.7%Language:R 24.6%Language:Python 22.8%Language:Scheme 12.8%Language:M4 9.5%Language:TeX 3.9%Language:Makefile 1.6%