38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Odd results from stat command

arq5x opened this issue · comments

Consider the following sparse D4 file created from an ENCODE bigwig file:

wget https://www.encodeproject.org/files/ENCFF405ZDL/@@download/ENCFF405ZDL.bigWig
time d4utils create -S  ENCFF405ZDL.bigwig ENCFF405ZDL.bigWig.d4

Now, create 10 random intervals in BED format to compute stats upon:

bedtools random -n 10 -l 100 -g human.hg38.genome | sort -k1,1 -k2,2n | cut -f 1-3 > test.bed
cat test.bed
chr1	31636177	31636277
chr13	41648360	41648460
chr13	90235345	90235445
chr14	106470082	106470182
chr15	54171880	54171980
chr16	1337049	1337149
chr20	60706629	60706729
chr4	19356795	19356895
chr6_GL000252v2_alt	3355374	3355474
chrX	151995155	151995255

Now, run stat on those regions:

d4tools stat ENCFF405ZDL.bigwig.d4 --stat mean --region test.bed
chr1	31636177	31636277	42949651.4
chr13	41648360	41648460	42949660.07
chr13	90235345	90235445	42949672.68
chr14	106470082	106470182	0
chr15	54171880	54171980	42949631.28
chr16	1337049	1337149	42949658.92
chr20	60706629	60706729	42949630.46
chr4	19356795	19356895	42949550.23
chr6_GL000252v2_alt	3355374	3355474	0
chrX	151995155	151995255	42949671.58

It looks like there is some sort of over/under flow issue with several of the mean values reported. For example, let's look at the exact depths for one of those 100bp regions using view:

d4tools view ENCFF405ZDL.bigwig.d4 chr1:31636177-31636277
chr1	31636176	31636276	0

This problem disappears when using dense file:

d4utils create ENCFF405ZDL.bigwig ENCFF405ZDL.bigWig.d4

d4tools stat ENCFF405ZDL.bigwig.d4 --stat mean --region test.bed
chr1	31636177	31636277	0
chr13	41648360	41648460	0.17
chr13	90235345	90235445	0
chr14	106470082	106470182	0
chr15	54171880	54171980	0
chr16	1337049	1337149	0
chr20	60706629	60706729	0
chr4	19356795	19356895	0
chr6_GL000252v2_alt	3355374	3355474	0
chrX	151995155	151995255	0