richardlehane / siegfried

signature-based file format identification

Home Page:http://www.itforarchivists.com/siegfried

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LoC+PROMOM identifier ids indexes aren't being built as one might expect, affecting the basis field results

ross-spencer opened this issue · comments

When an LoC identifier is built with PRONOM the sources are added as follows:

  1. All LoC sigs and ids are generated.
  2. The PRONOM identifiers we want for LoC records that have a PUID are then generated and attached.

Signatures and IDs are then returned to the identifier when it's built.

For the byte matcher this is done here.

OGG is a good example format here as the LoC identifier expects to have one LoC record with a byte-pattern and one PRONOM sequence that should match against it as well.

If we test against the OGG skeleton the ideal result is as follows:

$ ./sf ogg/fmt-203-signature-id-504.ogg
---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:40:23-04:00
signature   : default.sig
created     : 2020-06-07T21:39:38-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6 (signature 2/2)'
    warning : 

The Place() function inside the identifier looks up the IDs index, and returns position and total number of signatures for the matching pattern.

Again, the ideal for these indexes (I believe) should be as follows:

Indexes IDs (base.Place() [fdd000019 fdd000019 fdd000022 fdd000022 fdd000022 fdd000022 fdd000022 fdd000026 fdd000026 fdd000027 fdd000027 fdd000031 fdd000031]

Note that all of the identifiers run contiguously for each other within the slice, i.e. 0026 entries are next to each other, 0027 entries follow each other, etc.

Actual:

What we're seeing currently is:

Indexes IDs (base.Place() [fdd000019 fdd000022 fdd000022 fdd000022 fdd000026 fdd000027 fdd000031 fdd000019 fdd000027 fdd000022 fdd000022 fdd000031 fdd000026]

Note that 0026 entries are 4th and last in the slice. Siegfried will stop looping through the indexes to calculate the position and total number before it finds the 2nd-n identifier it is supposed to find.

The primary impact here seems to be visual. The binary pattern being used and the result returned is still accurate, but instead of:

  • basis : 'extension match ogg; byte match at 0, 6 (signature 2/2)'

Will be:

  • basis : 'extension match ogg; byte match at 0, 6'

Where we don't see the (signature 2/2) value we'd like to know which pattern matched specifically so that we can audit the results in more detail.

---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:38:21-04:00
signature   : default.sig
created     : 2020-06-07T21:35:35-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6'
    warning : 

I've some sample files here to help making recreate this a little easier.

OGG skeleton fmt-203-signature-id-504.zip

Restricted FDD set which is enough to recreate the issue without being the whole set restricted-set-fddXML.zip

OGG only FDD record ogg_fddXML.zip

nice pickup Ross!

It might be nice to fix in a way that all implementations of Identifers could benefit? E.g. by adding a sort feature to parseable.go in the internal/identifier package.

E.g. you could make a sorted implementation of the Parseable interface:
https://gist.github.com/richardlehane/8a8a1fec99f4d2879ed12373901b507f

If you then add a line to the ApplyConfig(p Parseable) func:
p = sorted{p}
you'd fix it?

Nice. I'm pretty confident that works as well. I've submitted your fix on the vanilla develop branch along with a proposed unit test for it. Hopefully that's not looking too bad.

Fixed with 1.9.0