emikoifish/parascopy

Parascopy

Parascopy is designed for robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing.

Created by Timofey Prodanov timofey.prodanov[at]gmail.com and Vikas Bansal vibansal[at]health.ucsd.edu at the University of California San Diego.

Citing Parascopy

Currently, the paper is in progress, please check later.

Installation

You can install Parascopy using conda:

conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda parascopy

Alternatively, you can install it manually using the following commands:

git clone https://github.com/tprodanov/parascopy.git
cd parascopy
python3 setup.py install

To skip dependency installation, you can run

python3 setup.py develop --no-deps

Additionally, you can specify installation path using --prefix <path>.

Some parascopy commands require installed

You do not need to install these tools if you installed parascopy through conda.

General usage

Main focus of this tool is a homology table - a database of duplications in the genome.

To construct a homology table you would need to run:

parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz

Note, that the reference genome should be indexed with both samtools faidx and bwa index. Alternatively, you can download a precomputed homology table.

To find aggregate and paralog-specific copy number (agCN and psCN) across multiple samples, you should run

# Calculate background read depth.
parascopy depth -I input.list -g hg38 -f genome.fa -o depth
# Estimate agCN and psCN for multiple samples.
parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o out1
# Estimate agCN and psCN using model parameters from a previous run.
parascopy depth -I input2.list -g hg38 -f genome.fa -o depth2
parascopy cn-using out1/model -I input2.list -t table.bed.gz -f genome.fa -d depth2 -o out2

See parascopy help or parascopy <command> --help for more information.

Output files

See output file format here.

Precomputed data

For hg38 you can use the following precomputed data:

Precomputed homology tables: hg19 (25 Mb) and hg38 (40 Mb).
Precomputed model parameters for five superpopulations: hg38 (11 Mb).

Known issues

If aggregate copy number jumps significantly in a short region (especially for disease-associated genes, such as SMN1), it is possible that the alignment file is missing reads for some duplicated loci. You can try to map unaligned reads, or map all reads using a different alignment. To extract unaligned reads use samtools view input.bam "*" (does not extract unmapped reads with a mapped mate).

Issues

Please submit issues here or send them to timofey.prodanov[at]gmail.com.

About

Paralog-specific copy number estimation for duplicated genes using WGS.

MIT License

Languages

Language:Python 96.4%Language:R 3.2%Language:Shell 0.4%

emikoifish / parascopy