38 / d4-format

The D4 Quantitative Data Format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

D4 file format magic number

jmarshall opened this issue · comments

d4/src/d4file/mod.rs defines FILE_MAGIC_NUM as b"d4\xdd\xdd" and the four characters do indeed appear in that order in .d4 files:

$ od -tx1z mpileup.1.d4 | head -1
0000000 64 34 dd dd 00 00 00 00 00 00 00 00 00 00 00 00  >d4..............<

However the Supplementary Notes in the paper describes the file header as

Offset Name Type Value
0 File Magic Number [u8;4] "\xdd\xddd4"
4 Format Version [u8;4] [0,0,0,0]
8 Frame File Root Directory Primary Size = 512

Regardless of which way around you read the magic number value (to allow for endianness differences in exposition), it is inconsistent with the value actually used.

I suspect this is best considered a typo in the Supplementary Notes.


I am also interested in how you envision the version field being used in future. I am looking at adding D4 to htslib's file format detection routines, and at the moment have htsfile printing out

$ htsfile mpileup.1.d4
mpileup.1.d4:	D4 version 0.0 genomic region data

by interpreting the 4 “Format Version” bytes as [u16_le;2] major.minor. However it may be best for now not to attempt to decode the version bytes and just print out mpileup.1.d4: D4 genomic region data.

I suspect this is best considered a typo in the Supplementary Notes.

Yes, you are right it should be the value defined in the source code file.

by interpreting the 4 “Format Version” bytes as [u16_le;2] major.minor. However it may be best for now not to attempt to decode the version bytes and just print out mpileup.1.d4: D4 genomic region data.

Yes, that's the case. The version number is bytes reserved for future use to distinguish the breaking change we may want to make in the future. So currently I think we can ignore that until there's a real need to use the version number bytes.

@jmarshall let us know of any other issues you see with respect to htslib support for D4. I have been asked to file an issue for htsget to propose support for interval and quantitative interval formats. I plan to do that this weekend and would welcome your thoughts there.