mortunco / pamir

Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pamir: Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

Pamir detects and genotypes novel sequence insertions in single or multiple datasets of paired-end WGS (Whole Genome Sequencing) Illumina reads by jointly analyzing one-end anchored (OEA) and orphan reads.

Table of contents

  1. Installation
  2. Running Pamir
  3. Example
  4. Visualization
  5. Publications
  6. Contact & Support

Installation

Installation from Source

Prerequisite. You will need g++ 5.2 and higher to compile the source code.

The first step to install Pamir is to download the source code from our GitHub repository. After downloading, change the current directory to the source directory pamir and run make and make install in terminal to create the necessary binary files.

git clone https://github.com/vpc-ccg/pamir.git --recursive
cd pamir
make
make install

Running Pamir

Prerequisites

Pamir's pipeline requires a number of external programs. You can either manually install them or take advantage of pamir's conda environment.yaml to install all the dependencies except the assembler:

conda env  create -f environment.yaml
source activate pamir-deps 
Dependencies Version
Python 3.x
samtools >= 1.9
mrsfast >= 3.4.0
BLAST >= 2.9.0+
bedtools >= 2.26.0
bwa >= 0.7.17
snakemake >= 5.3.0
RepeatMasker >= 4.0.9
minia >= 3.2.0 *
abyss >= 2.2.3 *
spades >= 3.13 *

*Note: You only need to install one of the assemblers.

Project Configuration

In order to run pamir, you need to create a project configuration file namely config.yaml. This configuration consists of a number mandatory settings and some optional advance settings. Below is the list of the all the settings that you can set in your project.

config-paramater-name Type Description
path Mandatory Full path to project directory.
raw-data Mandatory Location of the input files (crams or bams) relative to path.
population Mandatory Populuation/cohort name. Note that name cannot contain any space characters.
reference Mandatory Full path to the reference genome.
input Mandatory A list of input files per individual. Pamir 2.0 accepts BAM and CRAM files as input.
analysis-base Optional Location of intermediate files relative to path. default: {path}/analysis
results-base Optional Location of final results relative to the path. default: {path}/results
assembler Optional External assembler to use (minia, abyss, spades) default: minia
assembler_k Optional kmer to use for external assembler. default: 47
pamir_partitition_per_thread Optional Number of internal pamir jobs to be completed per thread. This is an advanced settings, modifying this can heavily affect the performance. Too small or too large may affect the performance negatively. default: 1000
blastdb Optional Full path to blast database to remove possible contaminants from the data.
centromeres Optional Full path to the file in bed format that contains centromeres locations. The calls in these regions will not be reported
align_threads Optional number of threads to use for alignment jobs. default: 16
assembly_threads Optional number of threads to use for assembly jobs. default: 62
other_threads Optional number of threads to use for other jobs. default: 16
minia_min_abundance Optional minia's internal assembly parameter. default: 5
min_contig_len Deprecated Minimum contig length from the external assembler to use. We know calculate this on the go.
read_length Deprecated Read length of the input reads. We know calculate this on the go.

The following a an example of config-yaml with two individuals.

path:
    /full/path/to/project-directory
raw-data:
    raw-data
reference:
    /full/path/to/the/reference.fa
population:
    my-pop
input:
 "samplename1":
  - A.cram
 "samplename2":
  - B.bam

Now, to run pamir on such a config file, you have to run the following command.

pamir.sh  --configfile /path/to/config.yaml

Since, pamir.sh is internally utilizing snakemake, you can pass any additionak snakemake parameters to pamir.sh. Here are some examples:

pamir.sh  --configfile /path/to/config.yaml -j [number of threads] 
pamir.sh  --configfile /path/to/config.yaml -np [Dry Run] 
pamir.sh  --configfile /path/to/config.yaml --forceall [rerun all steps regardless of the current stage]

Running pamir on High Performance Clusters (HPC)

Pamir can be run on HPC environments using the command below. At this moment, HPC module allows only utilizing Slurm Worldload Manager system. cluster.json was develop and optimized for our spesific case and it might require edit to make it compatible with other task manager systems. This topic is currently out of our developlment scopes but we are happy to provide help for those users have job/memory/queue problems.

Advanced tweaking tips for cluster settings

cluster.json contains all the cpu,memory, time and queue spesifications for individual tasks in the pipeline. Users can reduce/increase the values based on their system capabilities. We recommend maxing out minia_all and pamir_assemble_full_new resources for optimial performance as these two jobs are responsible from assembly tasks. task __default__ should be understood as our express task that requires on only cpu and minor memory.

pamir.sh --configfile config.yaml  -j110 -p --cluster-config cluster.json

Output Formats

Pamir will generate the following structure. Pamir generates a VCF file for detected novel sequence insertions.

[path]/
├── raw-data                       -> OR [raw-data]
│   ├── A.cram
│   ├── B.bam
├── analysis                       -> OR [analysis-base]
│   └── my-pop
└── results                        -> OR [results-base]
    └── my-pop
        ├── index.html             -> Summary fo events
        ├── summary.js             -> Summary required by index.html
        ├── data.js                -> Data required by index.html
        ├── events.repeat.bed      -> annotation of repeats for detected eveents
        ├── events.fa              -> all the detected events with 1000bp flanking region
        ├── events.fa.fai          -> index of events.fa
        └── ind
            ├── A
            │   ├── events.bam     -> mapping of the reads in the events region
            │   ├── events.bam.bai -> index
            │   ├── events.bed     -> location of events
            │   └── events.vcf     -> genotyped insertion calls
            ├── B
            │   ├── events.bam
            │   ├── events.bam.bai
            │   ├── events.bed
            │   └── events.vcf

Example

curl -L https://ndownloader.figshare.com/files/22813988 --output example.tar.gz
tar xzvf example.tar.gz
cd example
chmod +x configure.sh
./configure.sh
pamir.sh -j16 --configfile config.yaml

Visualization

index.html provides a quick way of looking at general overview of events. It is an alternative to working with vcf files in a friendly fashion. If you start your IGV, you can easily jump back and forth investigating your events from index.html.

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals. P. Kavak*, Y-Y. Lin*, I. Numanagić, H. Asghari, T. Güngör, C. Alkan‡, F. Hach‡. Bioinformatics (ISMB-ECCB 2017 issue), 33 (14): i161-i169, 2017.

Contact and Support

Feel free to drop any inquiry at the issue page .

About

Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:C++ 54.4%Language:Python 37.3%Language:HTML 5.6%Language:Makefile 1.7%Language:Shell 1.0%