clwgg / nQuire

A statistical framework for ploidy estimation using NGS short-read data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could you tell me the format of the binary file that nQuire create makes?

Sandman2127 opened this issue · comments

Hello clwgg,
Really cool tool you've designed here. This isn't an issue more of an extension question:

I want to extend this program to be able to use the data from a VCF file format as well. The reason is, I have de novo GBS calls from UNEAK, which naturally don't have bams because was no alignment ever performed. But I see no reason we couldn't use the read numbers for each allele at heterozygous sites from a de novo VCF, instead of a bam. I could build in all the filters you have as well.

When I was testing I saw that by running
nQuire view test.bin

I was able to find this format:
13 5 8
12 3 9
14 4 10
22 4 18
10 4 6
...

My issue is I cannot tell if this is something you've rearranged the output to look like, or if it is truely in that form just in binary. I've been trying to no avail using python to back convert my bin file to standard utf-8 text. So far all I can get is numbers that seem very out of order and don't end on the correct newline statements.

[2, 0, 4, 0, 6, 0, 19, 2, 0, 3, 0, 16, 0, 19, 2, 0, 8, 0, 11, 0, 17, 2, 0, 3, 0, 14, 0, 17, 2, 0, 4, 0, 13, 0, 17, 2, 0, 8, 0, 9, 0, 16, 2, 0, 4, 0, 12, 0, 13, 2, 0, 2, 0, 11, 0, 13, 2, 0, 4, 0, 9, 0, 11, 2, 0, 5, 0, 6, 0, 10]
[2, 0, 3, 0, 7, 0, 10]
[2, 0, 4, 0, 6, 0, 11, 2, 0, 5, 0, 6, 0, 11, 2, 0, 2, 0, 9, 0, 10]
[2, 0, 3, 0, 7, 0, 10]
[2, 0, 2, 0, 8, 0, 18, 2, 0, 8, 0, 10]

Perhaps it is stored in a hash.

I'm not against writing this in C as well, obviously I'm just not nearly as experienced there. Any ideas how I might get a tab delimited format (like the output for nquire view) into the bin format required for your downstream analysis? I could easily write those out for each sample and analyze them as you intend. I just have no idea what they need to look like in your binary format.

If I can get this to work I'll happily share it with everyone!

I've solved my problem. We can now convert heterozygous SNPs from Uneak, Tassel5 or Stacks into a format acceptable to nQuire. I'll post the solution soon.

Hi! I'm glad you were able to figure this out! As you probably noticed the .bin file is just a flat binary dump of numbers of a specified structure, the read/write routines to which are specified in dump_utils.h and dump_utils.c. The most important part here is that each number is of a specified bit-width, which will be important to reproduce when interacting with .bin files.
I used to have a prototype of creating a .bin file from a vcf in C (also using htslib) that I meant to share here, but I'm afraid it's on a computer that I no longer have access too... I'm happy to have a look at your solution though if that would be helpful.