MO-BCCRC/titan_workflow

Titan Pipeline:

Development information

Date Created: October 30 2014
Last Update: Mar 4, 2016 by dgrewal
Date Created: October 30 2014
Developer: Diljot Grewal <dgrewal@bccrc.ca>
Input: bam
Output: params.txt, .RData, seg, segs.txt, segs.txt.pygenes, titan.txt
Version: 5.3

TITAN pipeline accepts a list of tumour-normal pair of BAM files as input and infers the clonal cluster of events along with their estimates of cellular prevalence, normal contamination and tumour ploidy. The pipeline follows these steps:

Identify germline heterozygous SNP positions in the matched normal BAM file. This step is represented by run_mutationseq_TASK_1 in the workflow
Extract the tumour allele read counts from the tumour BAM file at each of the germline heterozygous SNPs from Step 1. (Generates input file #1). This step is represented by run_mutationseq_TASK_1 and convert_museq_vcf2counts_TASK_2 in the workflow
Extract the tumour read depth from the tumour BAM file using HMMcopy suite. Correct GC content and mappability biases using HMMcopy R package. (Generates input file #2). This step is represented by the following tasks in the workflow:
- run_readcounter_TASK_3,
- run_readcounter_TASK_4,
- calc_correctreads_wig_TASK_5
Run TitanCNA, including generating figures for chromosome plots. This step is represented by the following tasks in the workflow:
- run_titan_TASK_6,
- plot_titan_TASK_7,
- calc_cnsegments_titan_TASK_8,
- annot_pygenes_titan_TASK_9

1. Getting Started

The documentation for Kronos can be found here.

2. The Inputs

The pipeline takes a tab delimited file as input. The header of the file defines the keys and the each of the rows represents a value for these keys.

An input file for pipeline should resemble the following:

#sample_id    tumour_id    tumour_library_id    tumour    normal_id    normal_library_id    normal
SA123_A01234_SA123N_A01235    SA123    A01234    /path/to/SA123.bam    SA123N    A01235    /path/to/SA123N.bam
SA223_A01234_SA223N_A01235    SA223    A01234    /path/to/SA223.bam    SA223N    A01235    /path/to/SA223N.bam

3. Setup

The pipeline requires the following:

Softwares

Package/Program	Version *
python	2.7.x
mutationseq	4.3.7
R	3.1.x or higher

python should have the following packages installed:
- sklearn 0.14.1 (Other versions are not supported)
- IntervalTree
- numpy (tested for version 1.7.1 and highly recommended to link against BLAS)
- scipy (tested for version 0.12.0)
- scikits-learn (tested for version 0.14.1)
- matplotlib (tested for version 1.2.1)
- bamtools (tested for version 2.3.0 but modified slightly to meet our needs. included with mutationseq.)
- boost (version 1.51.0 or higher)
R should have the following packages installed:

Installing mutationSeq:

Mutationseq relies on the pybam library which must be compiled before you can start running the pipeline. To check if the library is compatible with your python please follow the following steps:

    cd /path/to/pipeline/components/run_mutationseq/component_seed/
    python
    >>> import pybam

An incompatible pybam library should generate an exception similar to the following:

    ImportError: ./pybam.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject

To recompile the pybam library follow the following steps:

    cd /path/to/pipeline/components/run_mutationseq/component_seed/
    rm -rf pybam.so
    rm -rf build/
    make BOOSTPATH=/path/to/boost PYTHON=python

The make command requires python to compile the library. It will use the default python for the system. Please ensure that the path to your python installation is added in the PATH variable. You can check if your python install is set propearly by running:

    which python

The command should point to the python installation that will be used to run the pipeline. Mutationseq documentation can be found here

Mutationseq Models:

mutationseq uses different models for the paired and the single mode and are included with the mutationseq package. The models are pickled with python 2.7.* and sklearn 0.14.1 and should be loaded on a similar setup. The model compatibility can be checked in the python interpreter by running

    python
    >>>from sklearn.externals import joblib
    >>>_ = joblib.load('/path/to/model.npz')

An incompatible model file will generate an exception similar to the following:

    TypeError: __cinit__() takes exactly 3 positional arguments (8 given)

    AttributeError: 'module' object has no attribute 'BestSplitter'

    ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'int'

while an IOError exception would indicate an incorrect path.

Reference files and flags

In order to run the museq pipeline you will need to add the paths to the following data in the setup file:

python: path to the python executable
mutationseq: path to the mutationseq executable
R: path to the R executable
reference: path to the reference genome fasta file
ld_library_path: specify ld_library_path for the python (set to None if the path is set properly)
pythonpath: specify path to python's site-packages (set to None if the path is set properly)
positions_file: path to the positions_file file
map: path to the map file
gc: path to the gc file
gene_sets_gtf: path to the gene_sets_gtf file
interval_file: path to the interval file (included with the pipeline)
r_libs: specify R_LIBS for loading the R packages (set to None if set properly or if packages are installed globally)
genome_type: specify the reference genome type (NCBI or UCSC)
model: path to the mutationseq model file (model_single_v4.0.2.npz file, included with mutationseq)
museq_interval_file: set to None if using the NCBI genome, specify path to the interval file included with the pipeline if running on UCSC aligned bam files
y_threshold: threshold on the required number of calls in y-chromosome to consider it when running TITAN
target_list: path to the target_list file (required if running on exomes)
chromosomes: specify the target chromosomes for TITAN.

4. The output

The output files will be saved in:

    /path/to/output/directory/{run_id}/{sample_id}/outputs/

The Titan Pipeline generates the following output files:

{sample_id}_outigv_[0-n].seg.pygenes * : Pygenes annotated IGV compatible segments
titan_plots/ : Each data point for each of the tracks represent a germline heterzygous SNP loci in the TITAN analysis. There are 3 tracks generated for each plot
- Copy number alterations (log ratio)
- Loss of heterozygosity (allelic ratio)
- Cellular prevalence and clonal clusters)

* n depends on interval file

All final results are stored in the outputs/results/ directory.

5. Changelog

v5.3 fixed a bug in calc_optimal_clusters, updated titan parameter names
v5.2 switched from pipeline factory to kronos
v4.6 pipeline suggests an optimal cluster.
v4.8 added support for new shahlab cluster
v5.0 performance improvements

For more information

http://kronos.readthedocs.org/en/latest/ or contact dgrewal@bccrc.ca

MO-BCCRC / titan_workflow