[enhancement] stat for a region/list of regions
darked89 opened this issue · comments
Hello,
Would it be possible to implement simple median coverage stats for a selected region?
While one can get the data using i.e. d4tools view input.bam 22:39349000-39349100
there is "some scripting" step required.
Having such feature would make looping around some putative ChIP-Seq peak much easier.
Best wishes,
Darek Kedra
Hi Darek,
This is possible with the stat
command:
d4tools stat -s median -r your.bed your.d4
If you want just one region, you can use UNIX echo -e
to create a BED file "on the fly":
d4tools stat -s median -r <(echo -e "chr22\t39349000\t39349100") your.d4
Hello,
Sorry for the delay. I have found some non-intuitive output of d4tools and pyd4, see below.
In short: how comes that the mean coverage value for 100bp interval is reported as 0 with 39 non-zero values?
Many thanks for your help
Darek Kedra
- d4tools
- stats
d4tools stat -s mean -r <(echo -e "chr1\t762650\t762750") test.d4
chr1 762650 762750 0
- view
d4tools view test.d4 chr1:762650-762750
chr1 762650 762693 1
chr1 762693 762697 0
chr1 762697 762748 1
chr1 762748 762750 0
- pyd4
interval: ('chr1', 751950, 752050)
mean_val_bin: [0.0]
bin_values = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
bin_vals_sum = 39
The snippet of Python code:
chrom_interval = (chrom, bin_start, bin_end)
mean_val_bin = chip_fh.mean([chrom_interval])
print(f"mean_val_bin: {mean_val_bin}")
bin_values = [x[2] for x in pyd4.enumerate_values(chip_fh, chrom, bin_start, bin_end)]
bin_sum = sum(bin_values)
print(bin_values) # (chrom, pos, value) chrom_vals, pos_vals, coverage_vals
print(bin_sum)
Same region from the original BAM:
samtools view ../data_timepoint_2/12_2_C_04873AAD_AACAGGTT-CTTGGTAT_R1_001_all.sorted.rmdup.bam chr1:762650-762750
VH00658:3:AAALLTMHV:1:2404:63279:6774 16 chr1 762643 1 50M * 0 0 GACAGGGGCGACCTCAGTGACGGAACCGGACACAGACGCAGATCTGGCAG CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
VH00658:3:AAALLTMHV:1:1611:27282:18531 0 chr1 762698 1 50M * 0 0 CGACAGGCTTCGGAGCATTTCCGGGCGTCGCGGGACTCCCCGCCGACAGG CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
@38 can you look into why mean above is 0 and not 0.5 and explain the differences in behavior?
Hello,
In case it is needed I can share the d4 file/minimal example giving that result. Or a mini-bam file.
- d4_tools version: D4 Utilities Program 0.3.4
- installed:
conda install -c bioconda d4tools
Hope it may help,
Darek Kedra