dat-ecosystem-archive / docs

Documentation resources for dat and the surrounding ecosystem [ DEPRECATED - see https://github.com/hypercore-protocol/new-website/tree/master/guides for similar functionality. More info on active projects and modules at https://dat-ecosystem.org/ ]

Home Page:https://dat-ecosystem-archive.github.io/docs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Please clarify hyperdrive metadata "children"/"trie" encoding

bnewbold opened this issue · comments

I've been working on a from-scratch implementation of dat/hyperstuff/sleep following the whitepaper/spec. One place i've run in to trouble is understanding directory structures are supposed to be encoded in the metadata register (the "Filename Resolution" section of the whitepaper).

A first small question about indexing conventions: in the whitepaper examples are given pointing to metadata entry index [0], but by convention this is always the Header/Index message (containing the public key of the paired content register). I assume this means entry numbers should be translated on lookup (aka, add one to the "hyperdrive index" to get the "hypercore/SLEEP index"), which is the convention i'll use below. I'll note that the dat log output does not do this translation though (starts with 1 index).

To clarify questions around the children/trie encoding, here's an example directory tree:

.
├── Animalia
│   └── Chordata
│       └── Mammalia
│           └── Carnivora
│               ├── Caniformia
│               │   └── Mustelidae
│               │       └── Lutrinae
│               │           └── Enhydra
│               │               └── E_lutris.txt
│               └── Feliformia
│                   └── Felidae
│                       └── Felis
│                           └── F_silvestris.txt
├── datapackage.json
├── Fungi
│   └── Basidiomycota
│       └── Agaricormycetes
│           └── Cantharellales
│               └── Cantharellaceae
│                   └── Cantharellus
│                       ├── C_appalachiensis.txt
│                       └── C_cibarius.txt
└── README.md

Assume these were added in order "README.md", "E_lutris.txt", "F_silvestris.txt", "C_cibarius.txt", "C_appelachiensis.txt". What should the entries look like?

0: name=/README.md
    children=[]
    stat=<protobuf>
1: name=/Animalia/Chordata/Mammalia/Carnivora/Caniformia/Mustelidae/Lutrinae/Enhydra/E_lutris.txt
    children=[[0]]
    stat=<protobuf>
2: name=/Animalia/Chordata/Mammalia/Carnivora/Feliformia/Felidae/Felis/F_silvestris.txt
    children=[[0]]
    stat=<protobuf>
3: name=/Fungi/Basidiomycota/Agaricormycetes/Cantharellales/Cantharellaceae/Cantharellus/C_cibarius.txt
    children=[[0]]
    stat=<protobuf>
4: name=/Fungi/Basidiomycota/Agaricormycetes/Cantharellales/Cantharellaceae/Cantharellus/C_appalachiensis.txt
    children=[[0], [], [], [], [], [], [3]]
    stat=<protobuf>

This clearly can't be correct, because there's no way to go from [4] to find any of the Animalia entries. Should [4]'s children look like [[0,1,2], [], [], [], [], [], [3]]? The paper says:

The children field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision.

Does that mean that every single file in the whole drive/archive is pointed to from every appended entry? Or is that the case if there are, eg, 100k files in the root directory, in which case an appended entry would be at least ~100 KB, even with the compressed encoding?

Later "F_silvestris.txt" is edited and updated, then "datapackage.json" gets added. What does it's node look like? Given this [6] node, how do you look up the most recent version of "F_silvestris.txt" (node [5]), without getting confused with [2]? Also, what about deletions?

The current state of the whitepaper looks like it has an early version of multi-writer protocol stuff added, but isn't complete. An older revision seemed more complete and is what I followed in trying to be compatible with stable dat (i'm running 13.8.1).

I took a look at the implementation in https://github.com/mafintosh/append-tree/blob/master/index.js, but there are no comments and i'm not familiar enough with javascript for it to be enlightening.

I did some reverse engineering, and now speculate that:

  • the children sub-arrays for a given entry includes entries for all files and subdirectories for every parent directory of the path; they do not necessarily contain entries for all files in the drive.
  • sub (child) directories at a given path are included, with the newest file at any depth of the directory (which could include this entry itself). So, for example, every new file at any depth will have as it's first subarray the list of all files in the root directory, and one file for each subdirectory of the root directory. this clarifies how recursive lookup works.
  • there is a preceding header varint, which currently has just a single bit, representing "endsWithSeq". If this flag is set, the current entry index is presumed to be appended to every sub-array. This flag seems to always be set, even for deletions.
  • unlike in the whitepaper examples, filenames themselves count as path entries. So, eg, "/README.md" has two path subarrays (one for "/", one for "README.md").
  • the convention is to refer to entries by their register index. The first register entry is the encoded content register public key, so the first drive/directory entry has index 1.

In the current protocol, endsWithSeq is always set for additions, and never set for deletions.