Occasionally large differences in coverage for same gene region in same sample
Jakob37 opened this issue · comments
Hello dear d4tools devs. Thank you for a very useful tool!
We are looking into using the d4 format as a basis for coverage calculations for patient samples, through the tool Chanjo2 (https://github.com/Clinical-Genomics/chanjo2) currently maintained by @northwestwitch.
Testing this out, I stumbled upon some strange numbers when investigating real samples. Some seemed to have very high overall coverage numbers, ~1000x, where 30-50x is expected.
I investigated this for one sample where I received 800x coverage. Running the genes directly through d4tools
revealed some genes with huge coverage
$ d4tools stat --region intervals.bed <sample>_coverage.d4 | sort -k4,4 -n -r
19 49635292 49640143 30268.71325499897
10 119651380 119677819 12509.91527667461
11 19182030 19210571 6439.562664237413
3 14125015 14143680 3130.0193410125903
1 77888513 77943895 2294.566086454083
15 38252836 38357249 1283.3227088580925
15 66386837 66491656 1249.3048111506503
12 112418351 112509918 665.9010123734533
18 36297713 36780220 405.1971349638451
...
Looking manually in IGV it looked more like 30-40x coverage.
Running only that specific region, I get more reasonable coverage. But this is really strange isn't it? That we get hugely different coverage for the same region dependent on when running one or several regions.
$ d4tools stat --region gene_19.bed <sample>_coverage.d4
19 49635292 49640143 39.8051948051948
I looked closer at this, and found that when running a bedfile with this region and others, it tended to get 30000x coverage, but when only running it by itself, it got 40.
The d4 file was generated using the very default command:
d4tools create --threads 10 <bam> <d4>
Is this a bug or am I missing something here?
Thankful for any input. Let me know if I can help in any way by providing more information. This is the final blocking question mark before starting using d4tools in practice, so it is quite urgent for us.
I'll continue investigating this. Unfortunately I cannot share the datasets I have used so far, but will investigate if the issue is still there if running a GIAB sample.
More context. I am running d4tools version 0.3.9 through a singularity container, but got identical results when running it through Chanjo2.
Generating the d4 file in sparse mode using the same BAM file I get a similar pattern.
$ d4tools create --sparse --threads 5 <bam> d4/sparse.d4
$ d4tools stat --region bed/intervals.bed d4/sparse.d4 > out.tsv
$ cat out.tsv | sort -k4,4 -n -r | head
19 49635292 49640143 39125.30612244898
10 119651380 119677819 16313.91516320587
11 19182030 19210571 8398.013699590063
3 14125015 14143680 4069.0358424859364
1 77888513 77943895 3010.7529522227437
15 38252836 38357249 1686.8336701368603
15 66386837 66491656 1627.4914948625726
12 112418351 112509918 868.0358207651228
18 36297713 36780220 532.3012909657269
...
Single gene gives lower number, also differing from the non-sparse run:
$ d4tools stat --region bed/gene_19.bed d4/sparse.d4 | head
19 49635292 49640143 48.781488352916924
OK I think I understood some more now, after poking around in the code. It looks like it occasionally is accumulating all reads from an upstream region down to the end of the target region.
I.e. I am working with these regions:
$ cat data/intervals_19_only.bed
19 45769709 45782552
19 4090321 4124122
19 49635292 49640143
19 55151767 55157773
The given value for the 49635292 - 49640143 range is actually nts collected 45769709 - 49640143 divided by the number of nucleotides between 49635292 - 49640143.
Seems like the self.sum
value in the impl TaskPartition
isn't cleared between the runs (in means.rs
), and also somehow accumulating over the region between the regions.
Looks like it might have to do with how ranges are assigned inside tracks.rs
, here:
d4-format/d4/src/d4file/track.rs
Lines 72 to 75 in 945b36b
Seems the this_end
variable can be assigned the end of the last previous interval, leading to accumulative counting starting from there rather than the beginning of the most recent interval.
After updating it as such, starting the accumulation from the beginning of the range instead, I get sane values.
if this_end != last_end {
let this_begin = top.get_range().0;
func(this_begin, this_end, &mut active_heap);
last_end = this_end;
}
Comparing runs with and without this change I get same coverage for some intervals, while differing for others. I don't understand yet what is going on there 🤔 Guessing this might be like this for a reason though. Would like some input on this from the devs if possible 🙏 @38 @arq5x
A final note for this week. I could reproduce the pattern using a GIAB sample, making me think it has more to do with the pattern in the region file than with the content of the d4 file.
My bed-file:
19 45769709 45782552
19 4090321 4124122
19 49635292 49640143
19 55151767 55157773
Running all together. Note the third coverage value of 30000+.
$ d4 stat --region bed/intervals_only_19.bed hg002_coverage.d4
19 4090321 4124122 47.764267329368955
19 45769709 45782552 45.254146227516934
19 49635292 49640143 34655.42465471037
19 55151767 55157773 37.25041625041625
Running one and one. Note that three are identical to the previous run.
$ time for interval in bed/intervals_only_19/*.bed; do d4 stat --region ${interval} hg002_coverage.d4; done
19 4090321 4124122 47.764267329368955
19 45769709 45782552 45.254146227516934
19 49635292 49640143 46.97567511853226
19 55151767 55157773 37.25041625041625
@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)?
@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)
Sounds great, thanks @brentp ! I am away from computer now, will continue to look at this and your suggestion when back on Tuesday
@Jakob37 , thanks for the clear report. I'll have a look at this next week, but based on what you've found, could you do the same change after the while loop on lines 84-87 (now 85-88 in your code)?
I tried adding the change also to the second func
call after the while loop. But when running my test examples that part of the code is never triggered, i.e. active_heap
is always empty when reaching that part. So it does not make any difference in my case.
Let me know if I can provide some other info to help you out here.
Looking into why this issue only happens sometimes.
Looks like the data is partitioned and calculations performed on each partition separately (maybe used for threading).
When ranges are processed in different partitions, there is no issue. But when multiple are processed in the same partition, the issue mentioned above is triggered.
Here, the first range ends just below 40,000,000. The coverage result of the second range looks sane.
19 30000000 39999999 49.76921967692197
19 49624990 49625000 39.2
19 49625010 49625020 74.5
Shifting it just one to an even 40,000,000, it looks insane:
19 30000000 40000000 49.7692204
19 49624990 49625000 47230219.2
19 49625010 49625020 74.5
@brentp have you been able to take a look at this?
I am considering putting together a PR with the minimal changes I made, but would be great to have someone with more contextual understanding of this code base look at this.
This should be fixed by #79 and included in version 3.10