wisekh6 / JaBbA

MIP based joint inference of copy number and rearrangement state in cancer whole genome sequence data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status codecov.io

JaBbA (Junction Balance Analysis)

Inferring balanced cancer genome graphs with mixed-integer programming analysis of read depth and junction patterns in WGS data.

Installation (R package)

  1. Install IBM ILOG CPLEX Studio. The software is proprietary, but can be obtained for free under IBM's academic initiative.

  2. Set CPLEX_DIR variable to your CPLEX Studio installation directory

export CPLEX_DIR=/path/to/your/copy/of/CPLEX_Studio/

NOTE: if CPLEX_DIR is set correctly then $CPLEX_DIR/cplex/include and $CPLEX_DIR/cplex/lib should both exist.

  1. Install JaBbA
devtools::install_github('mskilab/JaBbA')

Installation (jba executable)

  1. (after installing R package) Pull JaBbA git and add pulled directory to PATH
$ export PATH=${PATH}:$(Rscript -e 'cat(paste0(installed.packages()["JaBbA", "LibPath"], "/JaBbA/extdata/"))')
$ jba ## to see the help message
  1. test run jba executable on provided data
$ jba JaBbA/inst/extdata/junctions.vcf JaBbA/inst/extdata/coverage.txt 

 _____         ___    _      _____ 
(___  )       (  _`\ ( )    (  _  )
    | |   _ _ | (_) )| |_   | (_) |
 _  | | /'_` )|  _ <'| '_`\ |  _  |
( )_| |( (_| || (_) )| |_) )| | | |
`\___/'`\__,_)(____/'(_,__/'(_) (_)

(Junction     Balance     Analysis)

JaBbA 2018-02-13 21:29:50: Located junction file JaBbA/inst/extdata/junctions.vcf
JaBbA 2018-02-13 21:29:50: Located coverage file JaBbA/inst/extdata/coverage.txt
JaBbA 2018-02-13 21:29:50: Loading packages ...
JaBbA 2018-02-13 21:30:00: Starting analysis in ./jba_out
JaBbA 2018-02-13 21:32:13: Done .. job output in: ./jba_out

Usage (jba executable)

Usage: jba [options] JUNCTIONS COVERAGE
	JUNCTIONS can be BND style vcf, bedpe, rds of GrangesList
 	COVERAGE is a .wig, .bw, .bedgraph, .bed., .rds of a granges, or .tsv  .csv /.txt  file that is coercible to a GRanges
	use --field=FIELD argument so specify which column to use if specific meta field of a multi-column table

Options:
	-s SEG, --seg=SEG
		Path to .rds file of GRanges object of intervals corresponding to initial segmentation (required)

	-c COVERAGE, --coverage=COVERAGE
		Path to .rds, file of GRanges object of fine scale genomic coverage / abundance as tiled intervals (100 - 5000 bp) along genome (required)

	-j JUNCTIONS, --junctions=JUNCTIONS
		Path to rearrangement file, which can be VCF breakend format, dRanger tab delim output,  or an rds of GRangesList of signed locus pairs pointing AWAY from junction (required)

	--j.supp=J.SUPP
		Supplementary junctions which will be used in subsequent iterations, same format as '--junctions'

	-i TFIELD, --tfield=TFIELD
		Name of meta data field of ra GRanges or data frame that specifies tiers of junctions, where tier 1 is forced to be included

	-b NSEG, --nseg=NSEG
		Path to .rds file of GRanges object of intervals corresponding to normal tissue copy number, needs to have $cn field

	-d HETS, --hets=HETS
		Path to tab delimited hets file output of pileup with fields seqnames, start, end, alt.count.t, ref.count.t, alt.count.n, ref.count.n

	-l LIBDIR, --libdir=LIBDIR
		Directory containing karyoMIP.R file (eg default GIT.HOME/isva)

	-o OUTDIR, --outdir=OUTDIR
		Directory to dump output into

	-n NAME, --name=NAME
		Sample / Individual name

	-f FIELD, --field=FIELD
		Name of meta data field or column to use for abundance / coverage signal from abundance / coverage soignal file

	-k SLACK, --slack=SLACK
		Slack penalty to apply per loose end copy

	-z SUBSAMPLE, --subsample=SUBSAMPLE
		Numeric value between 0 and 1 specifying whether to subsample coverage for intra segment variance estimation

	-t TILIM, --tilim=TILIM
		Time limit for JaBbA MIP

	-p PLOIDY, --ploidy=PLOIDY
		Ploidy guess

	-q PURITY, --purity=PURITY
		Purity guess

	--cores=CORES
		Number of cores for JaBBa MIP

	-m ITERATE, --iterate=ITERATE
		How many times to iterate through tiers

	-w WINDOW, --window=WINDOW
		Window to dumpster dive for junctions around loose ends

	-e EDGENUDGE, --edgenudge=EDGENUDGE
		Edge nudge for optimization, to be multiplied by edge specific confidence score if provided

	--ppmethod=PPMETHOD
		choose from sequenza, ppurple, or ppgrid to estimate purity ploidy

	--indel
		if TRUE will force the small isolated junctions in tier 2 have non-zero copy numbers

	--allin
		if TRUE will put all tiers in the first round of iteration

	--boolean
		if TRUE will use Boolean loose end penalty

	--epgap=EPGAP
		threshold for calling convergence

	-x STRICT, --strict=STRICT
		if TRUE will restrict input junctions to only the subset overlapping seg

	--gurobi=GUROBI
		if TRUE will use gurobi (gurobi R package must be installed) and if FALSE will use CPLEX (cplex must be installed prior to library installation)

	-v, --verbose
		verbose output

	-y, --nudgebalanced
		Manually nudge balanced junctions into the model.

	--maxna=MAXNA
		Any node with more NA than this fraction will be ignored

	--blacklist=BLACKLIST
		Path to .rds, BED, TXT, containing the blacklisted regions of the reference genome

	--geno
		Whether to consider the `GENO` field in the input junctions VCF, use this flag if your SV VCF is generated by SvABA multisample run

	-h, --help
		Show this help message and exit

Output (R package)

  1. jabba.simple[.rds|.png|.cnv.vcf|.gg.rds]

    Main results, the optimized and simplified rearrangement graph. The four formats are R list object, PNG image of the graph generated by gTrack, VCF of the copy number variations, and gGraph object constructed with gGnome (learn more about gGnome here). In the list output, field "segstats" is the GRanges object of the nodes (including loose ends), field "adj" is the adjacency matrix, field "edges" is the edge table, field "gtrack" is the gTrack object used to generate the plot in the PNG file.

  2. karyograph.rds.ppfit.png

    This plot illustrates the distribution of the raw segmental mean of the coverage signal, with red dashed vertical lines indicating the grid of integer copy number states. When the grid align well with the peaks in the underlying histogram, it indicates the purity/ploidy estimation is relatively successful.

  3. jabba.seg.txt

    SEG format file of the final segmental copy numbers, compatible with IGV/ABSOLUTE/GISTIC and many more.

  4. opt.report.rds

    This file contains an R data.table object of the convergence statistics of all the sub-problems (identified by "cl" column). The column "convergence" indicates the state of the final solution:

    • 1: converged quickly, within short time limit (input tilim/10) to the stringent epgap (input epgap/1000)
    • 2: converged roughly, within short time limit to the relaxed epgap (input epgap)
    • 3: converged after a second round, within long time limit to the relaxed epgap
    • 4: hardly converged after a second round, even after long time limit still above the relaxed epgap

    For detailed explanation of tilim and epgap please read our manuscript and CPLEX help doc.

Attributions

Marcin Imielinski - Assistant Professor, Weill Cornell Medicine Core Member, New York Genome Center.

Xiaotong Yao - Graduate Research Assistant, Weill Cornell Medicine, New York Genome Center.

Funding sources

About

MIP based joint inference of copy number and rearrangement state in cancer whole genome sequence data.

License:MIT License


Languages

Language:R 92.9%Language:C 5.5%Language:M4 1.6%