RFC: New registry format and append-only log file

Question

RFC: New registry format and append-only log file

ericmj opened this issue 8 years ago · comments

Eric Meadows-Jönsson commented 8 years ago

https://gist.github.com/ericmj/36bf9e64c0566dfad277

The new registry format file size is 33% of the ets file and 60% of the size when gzipped (with a registry built from hex.pm on 2016-03-12).

/cc @hexpm/contributors @hexpm/rebar3

Fred Hebert · Answer 1 · Mon Mar 14 2016 20:08:50 GMT+0800 (China Standard Time)

Things I noticed after a quick run through:

arrays have an implicit limit of 255 elements, although their total length is undefined
strings also have an implicit limit of 255 bytes due to their length format.
the package id is a 32 bits unsigned int; is this expected to be unique, and unique across indexes or local to each? Are we okay with the limit of 4 millions and so packages?
String encoding is not defined
Is there any reason that DEPENDENCY and RELEASE sections are not implicitly tied to a package definition are are otherwise related to the order they're mapped in a file? I'm wondering here since we can skip entries we do not understand, whether this risks failing or helping future proofing; a new type of unknown field for example may or may not be seen as interrupting the prior type of package
the append-only log file does have the implication that it (or consumers) should not expect it to be compacted. Is this intended or is compaction a feature?
Mix releases and dependencies cannot be removed. You can REMOVE INSTALL but there is no ADD INSTALL command
the append-only log file may see its entries skipped in future implementations, are we okay with that or should all entries be understood for it to work?
No specification on how to behave is a section is understood but not its types; this is a problem that can happen either in future releases (say new string encodings are added) or when data is corrupted. Should the entire section be skipped, or only the unknown element?
In the fetch request, someone asking for a byte offset that points to the STRING of a package name may yield ways to corrupt/hijack a feed locally by, for example, having a STRING that maps to a REMOVE RELEASE command. Either the spec should mention requiring the byte to match records or use the standard other-range-unit: <token> header in the RFC to use records offset
Do note that this request above could be accidental if I make a byte offset request after a log has been rebuilt. The verification for the rebuild needs to also be done on the server and the client must send the expected build they use to avoid the conflict; an error for a bad request would make sense there.
Using content-length:1 to mean "things are up to date" and then asking people to ignore the body is subverting things. Why not just have a content-length of 0?

Eric Meadows-Jönsson · Answer 2 · Mon Mar 14 2016 21:21:48 GMT+0800 (China Standard Time)

Thanks for your comments @ferd, I will respond to each inline.

arrays have an implicit limit of 255 elements, although their total length is undefined
strings also have an implicit limit of 255 bytes due to their length format.

I will clarify this.

the package id is a 32 bits unsigned int; is this expected to be unique, and unique across indexes or local to each? Are we okay with the limit of 4 millions and so packages?

They are expected to be unique for the repository hosting the registry - I will clarify this in the spec. The limit will be 4 billion packages which I think will be fine.

String encoding is not defined

Will clarify that the encoding should be UTF8 and will create a new fixed-length binary type for the checksums.

Is there any reason that DEPENDENCY and RELEASE sections are not implicitly tied to a package definition are are otherwise related to the order they're mapped in a file? I'm wondering here since we can skip entries we do not understand, whether this risks failing or helping future proofing; a new type of unknown field for example may or may not be seen as interrupting the prior type of package

I have been thinking like you to make RELEASE section part of the PACKAGE section and DEPENDENCY part of RELEASE. It simplifies the spec but it has the downside that we need to change the section size type from INT16 to INT32 because packages with many releases may overflow the INT16. I believe we can maintain backwards compatibility because new fields should not change section ordering and we can introduce new section types that can be interspersed between PACKAGE, DEPENDENCY and RELEASE without breaking their relations. I will generate a registry with your proposed change and see if it has any major impact on file size. If there is only a small impact we should definitely make the change.

the append-only log file does have the implication that it (or consumers) should not expect it to be compacted. Is this intended or is compaction a feature?

It can be compacted, that's what the rebuild counter header is for. If the servers counter and your local counter do match the client knows the log file have been rebuild from scratch (which could be because it was compacted). Should I clarify this?

Mix releases and dependencies cannot be removed. You can REMOVE INSTALL but there is no ADD INSTALL command

That's a bug, REMOVE INSTALL should be REMOVE MIX CLIENT RELEASE.

the append-only log file may see its entries skipped in future implementations, are we okay with that or should all entries be understood for it to work?

We should be okay with skipping log file sections with the important note that all sections should still be checksummed, will clarify.

No specification on how to behave is a section is understood but not its types; this is a problem that can happen either in future releases (say new string encodings are added) or when data is corrupted. Should the entire section be skipped, or only the unknown element?

We can add a checksum header to ensure data integrity, the signature header already has this check, sort of. New types can only be added to new fields and new fields can only be added to the end of a section. This means new types should not break backwards compatibility.

In the fetch request, someone asking for a byte offset that points to the STRING of a package name may yield ways to corrupt/hijack a feed locally by, for example, having a STRING that maps to a REMOVE RELEASE command. Either the spec should mention requiring the byte to match records or use the standard other-range-unit: header in the RFC to use records offset

I will clarify that byte offsets should only be on section boundaries. There should be no reason why a client would use a byte offset that is not on a section boundary.

Do note that this request above could be accidental if I make a byte offset request after a log has been rebuilt. The verification for the rebuild needs to also be done on the server and the client must send the expected build they use to avoid the conflict; an error for a bad request would make sense there.

You should check the rebuild counter header before starting to interpret the bytes. I will clarify this.

Using content-length:1 to mean "things are up to date" and then asking people to ignore the body is subverting things. Why not just have a content-length of 0?

I explain this in the footnote. The reason is because servers seldom return a content-length of 0 on range request. Instead of respond with a 416 Range Not Satisfiable and a 0 content-length they respond with 200 OK and the full body which is exactly what we dont want. To work around this issue we fetch one extra byte to ensure the range is always satisfied. Also note that the specification should work on as many http servers as possible. Fastly, that we use as CDN, has this behaviour for example.

See the note in the RFC https://tools.ietf.org/html/rfc7233#section-4.4.

Eric Meadows-Jönsson · Answer 3 · Mon Mar 14 2016 22:01:38 GMT+0800 (China Standard Time)

Updated the gist based on @ferd's comments.

Eric Meadows-Jönsson · Answer 4 · Wed Jun 15 2016 18:38:54 GMT+0800 (China Standard Time)

Closing this in favor of another, simpler proposal in the coming days that splits the registry into multiple files.