very high panel sequencing depth issue - difference depth output in mosdepth depth, bedtools coverage, and sambamba output
ipstone opened this issue · comments
Hey Brent and everyone, thank you for all the awesome tools.
We have some high coverage panel sequencing data, but checking the depth of the regions using mosdepth, bedtools and sambamba, give quite a range of results (results obtained running these commands through snakemake file).
These tools are run with the default setting, what might cause such a huge difference in depth calculations?
Thanks in advance!
sambamba:
"sambamba depth region -L bed/study_genes.bed {input} > coverage/study_sambamba/interval_coverage/{wildcards.sample}_interval_coverage.txt"
# chrom chromStart chromEnd F3 readCount meanCoverage sampleName
1 36349022 36349047 NM_001317122_cds_0_0_chr1_36349023_f 502 400 Sample-10
1 36354027 36354211 NM_001317122_cds_1_0_chr1_36354028_f 958 372.63 Sample-10
1 36358157 36358278 NM_001317122_cds_2_0_chr1_36358158_f 593 309.859 Sample-10
1 36358697 36358879 NM_001317122_cds_3_0_chr1_36358698_f 777 323.709 Sample-10
1 36359274 36359411 NM_001317122_cds_4_0_chr1_36359275_f 834 374.27 Sample-10
1 36359637 36359772 NM_001317122_cds_5_0_chr1_36359638_f 686 315.548 Sample-10
1 36359915 36360003 NM_001317122_cds_6_0_chr1_36359916_f 669 357.909
...
bedtools:
"bedtools coverage -mean -a bed/study_genes.bed -b {input} > coverage/study/interval_coverage/{wildcards.sample}_interval_coverage.txt"
chr start end gene coverage sample
1 36349022 36349047 NM_001317122_cds_0_0_chr1_36349023_f 43840.9609375 Sample-10
1 36354027 36354211 NM_001317122_cds_1_0_chr1_36354028_f 43905.3320312 Sample-10
1 36358157 36358278 NM_001317122_cds_2_0_chr1_36358158_f 26675.0253906 Sample-10
1 36358697 36358879 NM_001317122_cds_3_0_chr1_36358698_f 32416.6210938 Sample-10
1 36359274 36359411 NM_001317122_cds_4_0_chr1_36359275_f 54923.9648438 Sample-10
1 36359637 36359772 NM_001317122_cds_5_0_chr1_36359638_f 35807.59375 Sample-10
1 36359915 36360003 NM_001317122_cds_6_0_chr1_36359916_f 29420.7265625 Sample-10
mosdepth:
mosdepth -n --by bed/study_genes.bed coverage/study_mosdepth/interval_coverage/{wildcards.sample}-interval {input}
gzip -dc coverage/study_mosdepth/interval_coverage/{wildcards.sample}-interval.regions.bed.gz > {output.interval_coverage}
chr start end gene coverage sample
1 36349022 36349047 NM_001317122_cds_0_0_chr1_36349023_f 289.08 Sample-10
1 36349022 36349047 NM_012199_cds_0_0_chr1_36349023_f 289.08 Sample-10
1 36354027 36354211 NM_001317122_cds_1_0_chr1_36354028_f 287.42 Sample-10
1 36354027 36354211 NM_012199_cds_1_0_chr1_36354028_f 287.42 Sample-10
1 36358157 36358278 NM_001317122_cds_2_0_chr1_36358158_f 277.69 Sample-10
1 36358157 36358278 NM_012199_cds_2_0_chr1_36358158_f 277.69 Sample-10
1 36358173 36358278 NM_001317123_cds_2_0_chr1_36358174_f 278.95 Sample-10
1 36358697 36358879 NM_001317122_cds_3_0_chr1_36358698_f 283.55 Sample-10
...
Hi Isaac, if you look through the issues, there are a lot of questions like this. I think that mosdepth does a good job of giving a sane answer. Reasons why the tools can differ:
- mosdepth does not look at base-quality so it will count all bases as covered even if they have very low quality
- mosdepth has different defaults for mapping-quality--I think by default it includes all reads
- mosdepth does not double-count overlapping pairs. So if r1 and r2 from a fragment overlap, it will only count the overlapped bases once, not twice. You can skip this by using --fast-mode.
I suggest to try mosdepth with different values for mapping-quality that make sense to you, and to try --fast-mode and see how much difference you see. I'm not sure how bedtools is getting so much higher coverage, but I suspect you'll get mosdpeth and sambamba to nearly agree with --fastmode