MO-BCCRC / titan_workflow

KRONOS workflow for TITAN pipeline

Home Page:http://compbio.bccrc.ca/software/titan/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Titan Pipeline:

Development information

Date Created: October 30 2014
Last Update: Mar 4, 2016 by dgrewal
Date Created: October 30 2014
Developer: Diljot Grewal <dgrewal@bccrc.ca>
Input: bam
Output: params.txt, .RData, seg, segs.txt, segs.txt.pygenes, titan.txt
Version: 5.3

TITAN pipeline accepts a list of tumour-normal pair of BAM files as input and infers the clonal cluster of events along with their estimates of cellular prevalence, normal contamination and tumour ploidy. The pipeline follows these steps:

  • Identify germline heterozygous SNP positions in the matched normal BAM file. This step is represented by run_mutationseq_TASK_1 in the workflow

  • Extract the tumour allele read counts from the tumour BAM file at each of the germline heterozygous SNPs from Step 1. (Generates input file #1). This step is represented by run_mutationseq_TASK_1 and convert_museq_vcf2counts_TASK_2 in the workflow

  • Extract the tumour read depth from the tumour BAM file using HMMcopy suite. Correct GC content and mappability biases using HMMcopy R package. (Generates input file #2). This step is represented by the following tasks in the workflow:

    • run_readcounter_TASK_3,

    • run_readcounter_TASK_4,

    • calc_correctreads_wig_TASK_5

  • Run TitanCNA, including generating figures for chromosome plots. This step is represented by the following tasks in the workflow:

    • run_titan_TASK_6,

    • plot_titan_TASK_7,

    • calc_cnsegments_titan_TASK_8,

    • annot_pygenes_titan_TASK_9

titan_pipeline

1. Getting Started

The documentation for Kronos can be found here.

2. The Inputs

The pipeline takes a tab delimited file as input. The header of the file defines the keys and the each of the rows represents a value for these keys.

An input file for pipeline should resemble the following:

#sample_id    tumour_id    tumour_library_id    tumour    normal_id    normal_library_id    normal
SA123_A01234_SA123N_A01235    SA123    A01234    /path/to/SA123.bam    SA123N    A01235    /path/to/SA123N.bam
SA223_A01234_SA223N_A01235    SA223    A01234    /path/to/SA223.bam    SA223N    A01235    /path/to/SA223N.bam

3. Setup

The pipeline requires the following:

Softwares

Package/ProgramVersion *
python2.7.x
mutationseq4.3.7
R3.1.x or higher

Installing mutationSeq:

Mutationseq relies on the pybam library which must be compiled before you can start running the pipeline. To check if the library is compatible with your python please follow the following steps:

    cd /path/to/pipeline/components/run_mutationseq/component_seed/
    python
    >>> import pybam

An incompatible pybam library should generate an exception similar to the following:

    ImportError: ./pybam.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject

To recompile the pybam library follow the following steps:

    cd /path/to/pipeline/components/run_mutationseq/component_seed/
    rm -rf pybam.so
    rm -rf build/
    make BOOSTPATH=/path/to/boost PYTHON=python 

The make command requires python to compile the library. It will use the default python for the system. Please ensure that the path to your python installation is added in the PATH variable. You can check if your python install is set propearly by running:

    which python

The command should point to the python installation that will be used to run the pipeline. Mutationseq documentation can be found here

Mutationseq Models:

mutationseq uses different models for the paired and the single mode and are included with the mutationseq package. The models are pickled with python 2.7.* and sklearn 0.14.1 and should be loaded on a similar setup. The model compatibility can be checked in the python interpreter by running

    python
    >>>from sklearn.externals import joblib
    >>>_ = joblib.load('/path/to/model.npz')

An incompatible model file will generate an exception similar to the following:

    TypeError: __cinit__() takes exactly 3 positional arguments (8 given)

    AttributeError: 'module' object has no attribute 'BestSplitter'

    ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'int'

while an IOError exception would indicate an incorrect path.

Reference files and flags

In order to run the museq pipeline you will need to add the paths to the following data in the setup file:

  • python: path to the python executable
  • mutationseq: path to the mutationseq executable
  • R: path to the R executable
  • reference: path to the reference genome fasta file
  • ld_library_path: specify ld_library_path for the python (set to None if the path is set properly)
  • pythonpath: specify path to python's site-packages (set to None if the path is set properly)
  • positions_file: path to the positions_file file
  • map: path to the map file
  • gc: path to the gc file
  • gene_sets_gtf: path to the gene_sets_gtf file
  • interval_file: path to the interval file (included with the pipeline)
  • r_libs: specify R_LIBS for loading the R packages (set to None if set properly or if packages are installed globally)
  • genome_type: specify the reference genome type (NCBI or UCSC)
  • model: path to the mutationseq model file (model_single_v4.0.2.npz file, included with mutationseq)
  • museq_interval_file: set to None if using the NCBI genome, specify path to the interval file included with the pipeline if running on UCSC aligned bam files
  • y_threshold: threshold on the required number of calls in y-chromosome to consider it when running TITAN
  • target_list: path to the target_list file (required if running on exomes)
  • chromosomes: specify the target chromosomes for TITAN.

4. The output

The output files will be saved in:

    /path/to/output/directory/{run_id}/{sample_id}/outputs/

The Titan Pipeline generates the following output files:

  • {sample_id}_outigv_[0-n].seg.pygenes * : Pygenes annotated IGV compatible segments
  • titan_plots/ : Each data point for each of the tracks represent a germline heterzygous SNP loci in the TITAN analysis. There are 3 tracks generated for each plot
    • Copy number alterations (log ratio)
    • Loss of heterozygosity (allelic ratio)
    • Cellular prevalence and clonal clusters)

* n depends on interval file

All final results are stored in the outputs/results/ directory.

5. Changelog

  • v5.3 fixed a bug in calc_optimal_clusters, updated titan parameter names
  • v5.2 switched from pipeline factory to kronos
  • v4.6 pipeline suggests an optimal cluster.
  • v4.8 added support for new shahlab cluster
  • v5.0 performance improvements

For more information

http://kronos.readthedocs.org/en/latest/ or contact dgrewal@bccrc.ca

About

KRONOS workflow for TITAN pipeline

http://compbio.bccrc.ca/software/titan/


Languages

Language:Python 90.3%Language:R 5.2%Language:Perl 4.4%Language:Shell 0.1%