D4 file format magic number
jmarshall opened this issue · comments
d4/src/d4file/mod.rs defines FILE_MAGIC_NUM
as b"d4\xdd\xdd"
and the four characters do indeed appear in that order in .d4 files:
$ od -tx1z mpileup.1.d4 | head -1
0000000 64 34 dd dd 00 00 00 00 00 00 00 00 00 00 00 00 >d4..............<
However the Supplementary Notes in the paper describes the file header as
Offset | Name | Type | Value |
---|---|---|---|
0 | File Magic Number | [u8;4] | "\xdd\xddd4" |
4 | Format Version | [u8;4] | [0,0,0,0] |
8 | Frame File Root | Directory | Primary Size = 512 |
Regardless of which way around you read the magic number value (to allow for endianness differences in exposition), it is inconsistent with the value actually used.
I suspect this is best considered a typo in the Supplementary Notes.
I am also interested in how you envision the version field being used in future. I am looking at adding D4 to htslib's file format detection routines, and at the moment have htsfile
printing out
$ htsfile mpileup.1.d4
mpileup.1.d4: D4 version 0.0 genomic region data
by interpreting the 4 “Format Version” bytes as [u16_le;2] major.minor
. However it may be best for now not to attempt to decode the version bytes and just print out mpileup.1.d4: D4 genomic region data
.
I suspect this is best considered a typo in the Supplementary Notes.
Yes, you are right it should be the value defined in the source code file.
by interpreting the 4 “Format Version” bytes as
[u16_le;2] major.minor
. However it may be best for now not to attempt to decode the version bytes and just print outmpileup.1.d4: D4 genomic region data
.
Yes, that's the case. The version number is bytes reserved for future use to distinguish the breaking change we may want to make in the future. So currently I think we can ignore that until there's a real need to use the version number bytes.
@jmarshall let us know of any other issues you see with respect to htslib support for D4. I have been asked to file an issue for htsget to propose support for interval and quantitative interval formats. I plan to do that this weekend and would welcome your thoughts there.