38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

create fails on some cram/bams

petersudmant opened this issue · comments

I've noticed that some crams/bams seem to result in empty d4 files. I'm not quite sure why. An example are the HGDP cram files (e.g. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/Yi/HGDP01180/alignment/HGDP01180.alt_bwamem_GRCh38DH.20181023.Yi.cram).

I can even subset these to smaller bams (say including just a small portion of chr1) and creation still fails. Let me know if you'd like me to post one of these small bams.

Thanks for reporting the issue. If you can provide a smaller bam that would be great! Thanks!

Here's an example bam: https://drive.google.com/drive/folders/1qKVTi4xnFrB9a3GXqGuWeXKaficFGMSR?usp=sharing

This was created by taking the aforementioned cram (see previous post) and running

samtools view -b -T GRCh38_full_analysis_set_plus_decoy_hla.fa HGDP01180.altbwamem_GRCh38DH.20181023.Yi.cram chr1:1-5000000 >example.bam
samtools index example.bam
d4tools create example.bam example.d4

the latest release d4-format-d4tools-v0.3.7 was used

Hi, just checked the BAM file, it seems the alignments mapping qualities are blow the default minimal mapping quality threshold.

You can add -q 0 to the command arguments you run d4.

$ d4tools stat example.d4 | grep -v '0$'
$ d4tools create example.bam -q0
$ d4tools stat example.d4 | grep -v '0$'
chr1    0       248956422       0.8444072031208739

Note if not specified the default min mapping quality is set to 60.

Ahhhhh -> the max quality in all these crams is 20 because it was capped. Thanks much. That was driving me nuts.

Would be helpful if the --help noted that default was set

This same thing has bit me a couple times. My default assumption about how a tool like this should behave is that it would do no filtering unless asked. q60 is a lot more stringent than I would expect for a default besides. Compare to other tools like Deeptools bamCoverage which do as few manipulations to the input data by default, allowing the user to specify additional stringency.

If the default isn't adjusted (which might be too breaking at this point, I'm not sure), please do make this very clear in the help docs since it is not obvious in the current state:

-q, --mapping-qual <mapping-qual>
            The minimal mapping quality (Only valid with CRAM/BAM inputs)

I agree. I know it would be breaking, but I think it makes more sense to default to zero (filter nothing) and make it clear in the release. @38 - what do you think?

Still seems to be a bug here, and something of a tricky one. Here's a tiny bam example:

https://drive.google.com/drive/folders/1qKVTi4xnFrB9a3GXqGuWeXKaficFGMSR?usp=sharing

This bam includes reads of various mapqs. Now, make a d4 file from this bam using -q0 or -q60. The results show that only the reads prior to the first q60 read are filtered out when using -q60 (I think). The d4 files are essentially identical. Now instead filter the bam w/ samtools for only -q60 reads and make a d4 file. The results are wildly different. See below. The bug is tricky to figure out in the code...

#start w/ test.0.bam test.0.bai

d4tools={path_to_d4_tools}
samtools view -b -q60 test.all.bam > test.q60.bam
samtools index test.q60.bam

$d4tools create -z -q 60 test.all.bam test.all.q60.d4
$d4tools create -z -q 0 test.all.bam test.all.q0.d4
$d4tools create -z -q 0 test.q60.bam test.q60.q0.d4

$d4tools  view test.all.q60.d4 chr1 >view.all.q60
$d4tools  view test.all.q0.d4 chr1 >view.all.q0
$d4tools  view test.q60.q0.d4 chr1 >view.60.q0

diff view.all.q60 view.all.q0
diff view.all.q0 view.60.q0

Hi @petersudmant, thanks for reporting this bug. Just pushed a fix to the master branch, it would be really helpful if you can confirm the fix actually works.

will do a bit later today! I noticed that && =). But, it actually seems to work fine w/ & interestingly. I'll let u know!

Given the critical nature of this fix, I think we need a new 0.4.0 release for the CLI and pyd4.

My tests suggest this fix worked (though, I have only run on a couple of examples). Thanks so much! I'll run some more testing over the next 36 hours. To be sure, I concur w/ @arq5x - this is a massive change in the outputs of the d4 files. Thanks so much for being so responsive w/ these fixes!