SWI-Prolog / contrib-protobufs

An interface to Google Protocol Buffers (protobuf)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are you interested in Proto parser written in Prolog?

wvxvw opened this issue · comments

This isn't a bug report. Just as an exercise, I wrote Protobuf definition parser in Prolog: https://github.com/wvxvw/protobuf-prolog-parser . It's not perfect (I haven't tested it extensively), and could certainly have more features, s.a. better error reporting. But, if you are interested, I can work on it to add missing features / fix bugs.

I know little about the protobuf infrastructure. Sent a mail to the creator (Jeff), attending him to this issue.

Hi:
My name is Jeff. I am the author of the protobufs library for swipl.

First, I think that Google's protobufs is going to be an important technology to the Industrial Internet of Things (IIOT). It is the default serialization method for Sparkplug MQTT payloads. The Sparkplug .proto file is attached. It doesn't parse with your tool. I don't know why. :-(

What is it that you want at the end of the day?

  1. a tool that does what 'protoc' does? That is a program that writes programs, or
  2. a tool that parses files written in the protobuf language into a hierarchy of terms that can be interpreted elsewhere. Sounds like this is where you're headed, at least for now.

I've been writing parsers for a long time. I like doing it, especially in Prolog. And I like to think that I've gotten pretty good at it. I've learned a few things over the years that perhaps you might benefit from:

  1. Parsers are complicated and it's easy to get lost in the details
  2. Parsers are hard to debug, especially when ambiguous grammars are involved
  3. Start with a grammar specification. If you can't find one then write one. I use RFC2234 ABNF.
  4. Include the entire ABNF grammar as commentary in your Prolog source
  5. For every terminal and production in the ABNF, write a Prolog equivalent
  6. In no time, you'll have a parser that's traceable to the ABNF
  7. Unit test as much as possible
  8. Find or make a 'golden' message. Test against Google's protoc compiler.
  9. Write a 'make check' rule in your Makefile

I've never used library(dcg_basics) before, so my source may not be 'contemporary', but it works and it's portable.

For example, I've attached the source for a parser for the ANSI X12 EDI grammar. To test: 'make -f x12.mk check'

sparkplug_b.proto.txt
x12_test.tar.gz

Hi @JeffR7456

I wrote this parser as an exercise mostly. My initial goal was writing a better Python back-end for Protobuf (there are some features missing from the official Protobuf generator, but extending it is really painful).

The fact that something doesn't parse isn't very surprising: I did very rudimentary testing so far. But I'll try to understand where I went wrong. The thing is... and this will answer some of your other questions: Protobuf grammar is very poorly defined. I have a link to the official description in the comments section of my parser, right at the top. That grammar itself has a lot of bugs in it. Some things contradict the existing protoc program, others allow things people probably never write, but who knows if they are valid... in the end of the day, whoever wrote that grammar didn't put much thought into it.

I also found some ANTLR and yacc grammars for Protobuf, which disagree in details about what Protobuf actually is :)

So, to be more specific:

a tool that parses files written in the protobuf language into a hierarchy of terms that can be interpreted elsewhere. Sounds like this is where you're headed, at least for now.

Yep, that would be right. The tool I work on for internal needs of the company needs to parse *.proto files and figure out some of their details without having to run generators etc.

Why your example didn't parse? -- well, that's because it's Protobuf v2, and my parser is for Protobuf v3. I didn't write the rules for the older format yet. So, that's even less surprising.

It's hard for me to promise any particular date. After all, I'm working on this mostly on my free time in the office, but I'll try to write more tests and, perhaps, improve the parser so that it gives more meaningful error messages when it fails.

Somewhat on a tangent: I'm less of a parser person, I'm more into network protocols / encodings. And... I hope that Protobuf dies a painful death :) It's a horrible idea, but, you know, it's my word against Google's, so what do I know. My particular problem with it is resiliency to slight modifications and lack of metadata. In practical terms this means, that, for example, our company, in order to send Protobuf messages around had to settle for something that is a single Protobuf message type, which has two fields, one for metadata and another one for byte-array, where the actual encoded message is. But that inner message has, again, metadata and a byte-array with actual message (the story repeats 3 times actually). Testers / QA / Automation are pulling their hair from all parts of their bodies when they deal with it :)

Hi:

You'll get no argument from me on the notion that the .proto grammar itself is poorly defined. I've attached an ABNF grammar that I wrote in 2009. It's for protobufs v2.1.0 (syntax version 2). I don't recall precisely where it came from--reverse engineered, I think. It's mostly complete.

You wrote your grammer for syntax version 3. I believe that version 3 is (or should be) backward compatible with version 2. I may be mistaken, but no matter. I tried a version 3 .proto. That didn't parse either.

The really cool thing about open-source development is that really smart people tend to self-select on the things that interest them. No one will ask when it will be done. But when you say that it is done, it better be a pretty good try.

BTW: the RPC aspect of protobufs is not interesting to me. There are too many security questions to make it worthwhile. I'm interested in portable data interchange with a wide variety of systems and languages.

On the tangent:

As a fellow encoding/protocols guy, I respectfully disagree. I believe that the protobuf wire stream could very well be the last interchange protocol. It's simple enough, open, portable, flexible, compact, regular, lossless, and deals effectively with size and endianess. Everything that you need at the wire stream level. My library won't win any speed contests, but so far, I haven't found anything reasonable that it cannot do.

If you make improvements on the ABNF grammar, please let me know, and feel free to include it in your source,

protobuf.abnf.txt

Back in the days I actually implemented Protobuf binary encoding for ActionScript (that's the language Macromedia and Mozilla came up with about 20 years ago, subsequently inherited by Adobe and discontinued recently). I also implemented BSON, Thrift, AMF, MessagePack, MP4 and few other formats. I also have some ideas about how Erlang manages its wire transfer, although I don't know the details.

What all of these have in common: some tricks to do caching, encoding variable length integers etc. Some can do it slightly faster / some can produce slightly more compact binary payload. But this doesn't really matter in a grand scheme of things. Typically, the gains from a fraction of a percent saved on the size of binary payload will be trumped by bad MTU settings on a router you have no access to.

The important difference however is how these formats behave in the context of interop, diverging versions, forward and backward compatibility, ease of debugging, ease of ensuring correctness. Protobuf, Thrift, MessagePack and MP4 style of formats all are terrible on all of those counts. They work well in closed systems, where programmers are guaranteed by their infrastructure that they will never have to contact incorrectly encoded messages, messages of wrong versions, that they will always have a human-readable source they can map the message to etc. Google's SRE spends tremendous effort to ensure that is the case. Google also has most of its products developed in-house, from in-house developed components. That's why it works well for them.

A company that cannot afford the same level of SRE as Google, that cannot afford to develop all components in-house, that cannot afford to talk exclusively to the programs designed by it, will suffer from using these formats. I have two examples to support my claims: a company developing distributed file system for VMWire datacenters and a company doing some pre- and a lot of post-trade services for world's major banks. Both using Protobuf with horrible consequences, which I attribute to lack of metadata.

For all the bad coding that came from Macromedia and Adobe, AMF format designed by them was surprisingly a good thing, namely because it was self-explanatory. The same is true of Erlang messaging. BSON was in a way similar to AMF, but I already don't remember the details. Projects and entire frameworks that used AMF came into existence and disappeared without trace, but even today, you can still find an AMF library, and you will be able to parse the output from a program generated 10 years ago and make sense of it. Messages encoded in Protobuf often become obsolete before they can even reach the program they are intended for. Once their description is lost, their meaning is lost forever.

I've got time to try your example Proto. Well, according to this: https://developers.google.com/protocol-buffers/docs/reference/proto3-spec#normal_field optional is no longer supported in field definition. Another problem is with defining extensions, they are also no longer supported. You said, you have a Proto v3 version of this message, can you post is anywhere please, so that I could check that? Thanks!

PS. Also found a problem with oneof definition, where I forgot separators. Now your example parses.

I've created a pull request #3 as a first step towards more "natural" handling of protobufs in Prolog.

My idea is to simply use the protoc --descriptor_set_out data (which itself is in protobuf form) and use that to construct a protobuf compiler for Prolog; that avoids having to deal with parsers for the protobuf language and with making a language-specific plugin. I've also written a "segmenter" for protobufs that does more-or-less what protoc --dump does -- the reason for this is that AFAICT the current implementation doesn't allow free order of fields (which the protobuf spec requires). (Also, I had trouble understanding how the current implementation works; I'm still working my way through the documentation and examples.)

And I've started a discussion on a re-implementation of library(protobufs): https://swi-prolog.discourse.group/t/does-anyone-care-about-protobufs/

Added a protoc plugin for swipl: Commit 3424149

There are still a few rough edges (mainly to do with .proto files that import .proto files; I intend to fix these, but protoc --swiplout=... is usable now.