tomas-abrahamsson / gpb

A Google Protobuf implementation for Erlang

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How i can recognize several messages in binary stream ?

loguntsov opened this issue · comments

Hello.
Thank you for your library and time.

I'm using this library with my grammar to generate erl-code to encode/decode messages.
But i can't find logic how i can generate stream of messages and parse this stream from binary ?
I expects this logic should return me all parsed packets in binary (as list) and rest of binary for accumulation with next binary-packet.
But i can't find this thing.

Could you point me right way to get this logic from your library ?

Thank you.

Hi, you need some form delimiter outside, so that you know where the message ends. A common form is to have some length-indicator before the message itself.

Over the wire, a message is just a series of fields, and the protobuf semantics is that if a field occurs multiple times in the binary to decode, the field value gets merged (sub-messages) appended (repeated) or overwritten (optional or required scalars). So the protobuf wire-format itself does not have anything that indicates the end of a message, you need to add that yourself. For a file it could be the end of the file, for a byte stream, it could be a length before the message, so you know how long it is.

Hi, I'll close this. Feel free to re-open or open another if there are any more questions.

Hello.

Sorry for late answer.

I think if you will look into code what generated by GPB you can see, it will parse all messages if they are going without any delimiters and subpackets, but just only last parsed packet will be available outside.

Do you see this ? could you make interface to get list of parsed packets and rest of binary,
i think it will be more universal interface.
Thank you.

Hi, no probs. I'm not sure I follow exactly what generated piece of generated code you are referring to? Could you give some example or a link?

Some pointers to info that may or may not be useful for you:

  • If on the wire-level you would have this format <varint-encoded length of message><message octets> then you might be interested in gpb:decode_varint/1 which returns {Int,Rest::binary()} gpb:decode_packet/3 which uses gpb:decode_varint/1 and has an example on how to use it for a tcp connection.
  • If on the wire-level you would instead have this framing format <4 octets length of message><message> then you could use for example the {packet,4} option to gen_tcp.

~ ~ ~

About your idea of changing the API of the generated code to return {Decoded,Rest}: Maybe, but (a) it will be quite a bit more complicated to generate such code and (b) I don't think it would help. I'll elaborate:

(a) It would be more complicated because on decoding, for each field to decode, it will need to catch and wrap decoding failure due to insufficient number of available octets, and this will probably hurt performance quite a bit.

(b) It would not really help, because of the problem of knowing when to there are no more fields to decode is just moved one step: the programmer will need to know when there are no more fields whether he or she should call decode again with more data or not, and the programmer will look for some kind of framing, a length field, an end-of-file indication or similar, and we are back where we started.

Illustration: If we have a simple proto, message Msg { optional int32 f = 1 } then depending on how much data you have, you could decode more and more, if you have more valid input, due to the semantics of field merging I wrote briefly about earlier. So you normally want some kind of framing length or end-marker or similar anyway.

47> {ok, x, B} = gpb_compile:string(x, "message Msg { optional int32 f = 1 }", [binary, maps]).
48> code:load_binary(x, "x.erl", B).
49> x:encode_msg(#{f => 1}, 'Msg').
<<8,1>>

50> x:decode_msg(<<8,1>>, 'Msg').
#{f => 1}
51> x:decode_msg(<<8,1, 8,2>>, 'Msg').
#{f => 2}
52> x:decode_msg(<<8,1, 8,2, 8,3>>, 'Msg').
#{f => 3}

What happens here is that each decoded field gets merged into the message, for scalars this means getting overwritten.

hello @tomas-abrahamsson

So for example. I have this proto file:

syntax = "proto2";

message test {
    optional string uuid  = 1;
}

and your library was generates this code: https://gist.github.com/loguntsov/d39d21bc3af09a32372a6e68782f949d

and there is main loop function which is parsing message:

dfp_read_field_def_test(<<10, Rest/binary>>, Z1, Z2, F@_1, TrUserData) -> d_field_test_uuid(Rest, Z1, Z2, F@_1, TrUserData);
dfp_read_field_def_test(<<>>, 0, 0, F@_1, _) -> #test{uuid = F@_1};
dfp_read_field_def_test(Other, Z1, Z2, F@_1, TrUserData) -> dg_read_field_def_test(Other, Z1, Z2, F@_1, TrUserData).

it takes binary and try apply d_field_test_uuid.
d_field_test_uuid gets back to this function with Rest binary.

the third clause skips some binary data, but gets back to dfp_read_field_def_test again.
so it's parsing each message from binary-stream, but all of them wiill be forgotten and only last parsed message will be as result of parsing.

Actually i could name it as bug, but i would wanted to name it as a feature :)

If you have many messages then only 1,2 clauses will works. And only last message will be result, because you have no accumulation.

the third clause skips some binary data, ... so it's parsing each message from binary-stream, but all of them wiill be forgotten and only last parsed message will be as result of parsing.

Do you mean it is parsing each field from the binary stream? In what way do you mean things are forgotten?

The quoted piece of code is the fast-path decoder, for quickly recognizing when the wire-encoding is on the minimal form (which should be the common case). If it fails to recognize a field (the third clause) then it proceeds to unpacking the field in a more general (but slightly slower) way. If the field is unknown, it is skipped, but otherwise it is stored.

You could also try some messages with more fields and with sub-messages and with repeated fields to study further the difference how fields get merged. Merging of strings is like for integers, which means overwrite. One field is not necessarily the same as one message. Sub-messages are length-delimited already, but the top-message is not. It is just a stream of fields to get merged (ie overwriting, recursively merging or adding to a sequence). This is how the protobuf is defined.

I meant all messages what were parsed will be forgotten, because binary contains other messages of same type.
And this function returns only last message from all binary.
Actually i can do some modification of generated code. If i will do it (with current generated code), could you change API and extend of your generator to implement my way ? Then nobody will need any size of messages and other separators.

If i will do it (with current generated code), could you change API and extend of your generator to implement my way ?

Depends. Your changes would of course need to be optional. Existing users expect the current API. There are of course other considerations as well.

And this function returns only last message from all binary.

The last seen value for the field is used, as defined here: https://developers.google.com/protocol-buffers/docs/encoding#optional:

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. [...]