scripts for read-depth cnv annotation
git clone https://github.com/ksenia-krasheninnikova/cnv_scripts.git
Requires:
- kentUtils
- mrfast, mrsfast, mrcanavar http://mrcanavar.sourceforge.net/manual.html
- samtools
- bedtools
Genome Masking
- Repeat Masker
- Tandem Repeat Finder
- Window Masker [deprecated]
- Partition scaffolds and contigs into kmers of 36bp (with adjacent khmers overlapping 5 bps) and map them to the assembly using mrsFast to account for multi mappings’ ex.:
split_assembly_to_substrings reference.fa 36 5 | sort | uniq | awk '{print "@kmer"NR"\n"$0}'> reference.kmers_36_5.fa
mrsfast --index reference.fa
mrsfast --threads 64 --search reference.fa --seq reference.kmers_36_5.fa -o reference.kmers_36_5.sam
find overrepresented kmers (mapped more than twice)
grep -v -e "@SQ" -e "@HD" reference.kmers_36_5.sam | cut -f10 | sort | uniq -c | sed 's/ \+ //g' | awk '{if ($1 > 2) print;}'> reference.kmers_36_5.lst
mask overrepresented kmers
FIXME: invokes zombie-processes and kills I/O
find_in_sam_to_bed reference.kmers_36_5.sam reference.kmers_36_5.lst > reference.kmers_36_5.bed
mask from bed
maskFastaFromBed -fi reference.fa -bed kmers_36_5/all.bed -fo reference.kmers_36_5_overrepresented_kmers.fa
- Importantly, because reads will not map to positions covering regions masked in the reference assembly, read depth will be lower at the edges of these regions, which could underestimate the copy number in the subsequent step. To avoid this, the 36 bps flanking any masked region or gap were masked as well and thus not included within the defined windows.
mask_padding reference.kmers_36_5_overrepresented_kmers.fa > reference.kmers_36_5_overrepresented_kmers_padding_36bp.fa
mv reference.kmers_36_5_overrepresented_kmers_padding_36bp.fa reference.final.fa
samtools faidx reference.final.fa
mrfast --index reference.final.fa
- run prep mode of mrcanavar get assembly gaps coordinates ex.: (get script from https://gtamazian.com/2016/06/23/converting-an-agp-file-to-the-bed-format/)
agp2bed.py hg38.agp hg38.gaps.bed
OR
get_gaps.py 306-KK-0012.fasta > gaps.bed
mrcanavar --prep -fasta reference.final.fa -gaps hg38.gaps.bed -conf reference.conf
Process Individuals
run_individuals_fastq_mapping pe_1.fastq.gz pe_2.fastq.gz reference.final.fa id working_dir destination_dir reference.conf --threads 10
see run_Mallick.sh
References:
[1] Alkan et al. Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing, Nature Genetics, 2009