yeslogic / fathom

🚧 (Alpha stage software) A declarative data definition language for formally specifying binary data formats. 🚧

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Challenges arising from the OpenType `glyf` table

mikeday opened this issue · comments

The OpenType glyf table contains a run-length encoded array of flags that determines the types of the heterogeneous arrays that follow it. Describing this structure poses a number of challenges which also appear elsewhere in other binary formats.

Sequences of variably sized items

Consider these two format constructors:

Array length format : Format
Repeat length format : Format

Both of them represent the specified format repeated length times, but Array has the additional constraint that the format must be fixed size so that the offset of every element and the size of the entire array is known in advance without needing to read its content.

Since Array is a strict subset of Repeat it does not add any power to the language, but being able to assert that a value is intended to have array semantics and have the correctness of that assertion checked by the compiler is valuable for authors and readers alike.

(Both format constructors may well use the same parsed representation type).

Sequences of unspecified length (#352)

It is not unusual for binary formats to contain repeating sequences whose length is unspecified but can be determined by inspecting the data they contain, for example "an array of bytes that sums to 100". These can be described with a RepeatUntil or RepeatWhile format constructor which finds the smallest length that satisfies the specified condition.

Some binary formats contain repeating sequences whose length cannot be determined by inspecting the data they contain, in which case they either run to the end of the current file section (which is at least known in advance) or until they hit some other data structure which is reached via an offset. The Apple morx table regularly uses this design pattern of offsets to arrays of unknown length that are indexed from other arrays; the length could conceivably be determined by analysis of the other data but it is expected that implementations will simply index into them as required and only fail if some indices fall completely outside the current file section.

Expanding run-length encoded data

Binary formats may use run-length encoding to compress arrays that feature repeated elements into (data, count) pairs, however the resulting packed array requires expansion before implementations can index into it. This requires a "dup" primitive to duplicate the data by the required count and a "flatten" or "concat" primitive to flatten the resulting nested sequence.

Another approach is to map the packed array to an array of ranges with a loop primitive capable of threading through the last index from the previous value. For example the packed array [A*2, B*1, C*3] which would expand to [A,A,B,C,C,C] could instead be mapped to [{A,0,2}, {B,2,1}, {C,3,3}]. This array of ranges can then be searched in a manner similar to the Class and Coverage tables in OpenType Layout used for mapping ranges of glyphs to integer indices.

(And sometimes you may be streaming the compressed array in parallel with some other array, in which case it may be expanded but only one element at a time without needing to persist in its expanded form; this could just be lazy evaluation though).

Heterogeneous sequences

Consider this format:

def ouch = {
    length: U16Be,
    types: Array length U8,
    values: Repeat length ???
}

Each element of the types array is 1, 2, or 4, specifying whether the corresponding element in the values array is a U8, U16Be, or U32Be. But how do we type this? We could introduce a more powerful constructor for repeating formats where the element format is a function of the index, however we still want every element to have the same parsed representation type, so we immediately hit the issue of being unable to constrain formats in this way (#354).

Can a dependent format be lifted to a non-dependent format in every case? For example any function from integer to format can be lifted to a record that includes the argument as a field, making the representation type homogeneous again.

Yeah I definitely agree that the name of the array formats in fathom is currently a bit of a misnomer. Having a fixed size version could definitely be useful!

In the Idris experiments I came up with this approach:

||| Construct a format based on a type tag
value : Nat -> Format
value 1 = u8
value 2 = u16Be
value 4 = u32Be
value _ = fail


||| An annoying example from: https://github.com/yeslogic/fathom/issues/394
ouch : Format
ouch = do
  len <- u16Be
  types <- repeat len u16Be
  values <- tuple (map value types)
  --        ^^^^^ heterogeneous sequence of formats
  pure ()

Where tuple is defined as:

tuple : {len : Nat} -> Vect len Format -> Format

Repr (tuple tys) = HVect (map Repr fs)

This uses the HVect type as the in-memory representation. I'm not sure how well this scales however. Still pondering.

And yeah I the issues with indexing types with the length is something to consider. An alternate approach could be to define length-limited sequences as a refinement on sequences of unknown length. This is what Lean does for their vector type, for instance. The issue there is managing the proof... which could mean looking into refinement types, alas.