38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[enhancement] stat for a region/list of regions

darked89 opened this issue · comments

Hello,

Would it be possible to implement simple median coverage stats for a selected region?
While one can get the data using i.e. d4tools view input.bam 22:39349000-39349100
there is "some scripting" step required.

Having such feature would make looping around some putative ChIP-Seq peak much easier.

Best wishes,

Darek Kedra

Hi Darek,

This is possible with the stat command:

d4tools stat -s median -r your.bed your.d4

If you want just one region, you can use UNIX echo -e to create a BED file "on the fly":

d4tools stat -s median -r <(echo -e "chr22\t39349000\t39349100") your.d4

Hello,

Sorry for the delay. I have found some non-intuitive output of d4tools and pyd4, see below.
In short: how comes that the mean coverage value for 100bp interval is reported as 0 with 39 non-zero values?

Many thanks for your help

Darek Kedra

  • d4tools
    • stats
 d4tools stat -s mean -r <(echo -e "chr1\t762650\t762750") test.d4
 chr1    762650  762750  0  
  • view
d4tools view test.d4 chr1:762650-762750
chr1    762650  762693  1
chr1    762693  762697  0
chr1    762697  762748  1
chr1    762748  762750  0
  • pyd4
interval: ('chr1', 751950, 752050)
mean_val_bin: [0.0]  
bin_values = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  
bin_vals_sum = 39  

The snippet of Python code:

chrom_interval = (chrom, bin_start, bin_end)

mean_val_bin = chip_fh.mean([chrom_interval])

print(f"mean_val_bin: {mean_val_bin}")

bin_values = [x[2] for x in pyd4.enumerate_values(chip_fh, chrom, bin_start, bin_end)]
bin_sum = sum(bin_values)

print(bin_values) # (chrom, pos, value) chrom_vals, pos_vals, coverage_vals
print(bin_sum)

Same region from the original BAM:

samtools view ../data_timepoint_2/12_2_C_04873AAD_AACAGGTT-CTTGGTAT_R1_001_all.sorted.rmdup.bam  chr1:762650-762750 

VH00658:3:AAALLTMHV:1:2404:63279:6774   16      chr1    762643  1       50M     *       0       0       GACAGGGGCGACCTCAGTGACGGAACCGGACACAGACGCAGATCTGGCAG      CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU                                                                                                                                    
VH00658:3:AAALLTMHV:1:1611:27282:18531  0       chr1    762698  1       50M     *       0       0       CGACAGGCTTCGGAGCATTTCCGGGCGTCGCGGGACTCCCCGCCGACAGG      CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU

@38 can you look into why mean above is 0 and not 0.5 and explain the differences in behavior?

Hello,

In case it is needed I can share the d4 file/minimal example giving that result. Or a mini-bam file.

  • d4_tools version: D4 Utilities Program 0.3.4
  • installed: conda install -c bioconda d4tools

Hope it may help,

Darek Kedra

Hi @darked89 , I believe this is dup to issue #54 and is already fixed in the latest release.

I believe d4tools on bioconda is updated already, please check if the latest upgrade resolves this issue.