golang / protobuf

Go support for Google's protocol buffers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

protodesc: feature "message_encoding" (in Protobuf editions) triggers proto2 group validation checks

jhump opened this issue · comments

I'm using latest on main (v1.33.1-0.20240319125436-3039476726e4). I think v1.33.0 may exhibit similar behavior. If not, then perhaps this is a known issue that will be fixed before the next v1.34 release.

The problem is in protodesc.NewFile (and thus also in protodesc.NewFiles), when processing a descriptor for a source file that uses editions and uses the message_encoding feature:

// test.proto
edition = "2023";

package foo.bar;

message Foo {
   Bar abc = 1 [features.message_encoding = DELIMITED];
}
message Bar {
  string name = 1;
}

I can successfully compile this file to a descriptor with protoc -o test.protoset test.proto. But if I load this data, unmarshal into a *descriptor.FileDescriptorSet, and then call protodesc.NewFiles, I get an error:

proto: message field "foo.bar.Foo.abc" is an invalid group: message and field must be declared in the same scope

This is a constraint of proto2 groups and is being incorrectly applied to this field because it uses the same tag-delimited wire format as groups. If I try to address the above error, I get a different error.

 message Foo {
   Bar abc = 1 [features.message_encoding = DELIMITED];
+  message Bar {
+    string name = 1;
+  }
 }
-message Bar {
-  string name = 1;
-}
proto: message field "foo.bar.Foo.abc" is an invalid group: field name must be lowercased form of the message name

So this validation is requiring me to make this look like a proto2 group:

 message Foo {
-  Bar abc = 1 [features.message_encoding = DELIMITED];
+  Abc abc = 1 [features.message_encoding = DELIMITED];
-  message Bar {
+  message Abc {
     string name = 1;
   }
 }

From my readings, features.message_encoding = DELIMITED is a feature flag to enable proto2 group semantics in protobuf editions, and not a replacement encoding for messages.

That is, it makes sense that it enacts proto2 semantics, because the feature is to turn on the proto2-compatible semantics.

@puellanivis, I don't believe that's the correct interpretation. Otherwise, protoc and C++ runtime (the canonical implementation of Editions) would enforce these same constraints.

This section of the docs for the initial edition describe it as only impacting the wire format: https://github.com/protocolbuffers/protobuf/blob/main/docs/design/editions/edition-zero-features.md#featuresmessage_encoding.

And this comment in descriptor.proto suggests the same (emphasis mine):

In Editions, the group wire format can be enabled via the message_encoding feature.

If it were about preserving other aspects of the legacy group feature, I don't think it would have been named "message encoding".

If none of these links are compelling, maybe @mkruskal-google can chime in to correct whichever of us has misinterpreted it.

Well, the example given in the https://protobuf.dev/editions/features/#message_encoding page as well as your link demonstrates the constraints you point to. So, all the examples meet the proto2 group constraints.

I’m not sure that you’ve demonstrated that the C++ runtime does not also enforce the proto2 constraints? That protoc might not enforce the constraints is, I suppose, in part because we’ve switched from syntactic sugar to an explicit option flag. So adding checks for the option flag into the constraints would be a bit wonky.

The text in the docs I have seen simply does not state anything about other proto2 group constraints. It only describes the feature as a toggle for the wire format. It seems dangerous to assume additional hard constraints just from the shape of examples. The examples look that way because they clearly come from a transformation of proto2 groups -> editions.

As far as the C++ runtime: it and protoc share the same implementation, since protoc is implemented in the C++ runtime. They both use the C++ DescriptorPool as a registry, which also handles validation and transformation of plain descriptor protos into richer descriptor types (akin to protodesc, transforming descriptorpb messages into protoreflect.Descriptor instances).

This file, for example, works fine with protoc. I can generate C++ and Java code:

// test.proto
edition = "2023";
message Foo {
  string name = 1;
}
message Bar {
  Foo field1 = 1 [features.message_encoding = DELIMITED];
}
protoc test.proto --cpp_out=. --java_out=. --experimental_editions

I can even write a simple harness to make sure the generated Java code works and doesn't fail at startup when processing the embedded descriptors:

// Main.java
import java.util.Base64;
public final class Main {
  public static void main(String[] args) {
    Test.Bar b = Test.Bar.newBuilder().build();
  }
}
$ protoc test.proto --java_out=. --experimental_editions
$ javac Main.java Test.java -cp protobuf-java-4.26.0.jar
$ java -cp .:protobuf-java-4.26.0.jar Main

I can also use protoc to interact with a message definition, which uses the C++ protobuf runtime's dynamic message implementation. And it is happy to work with delimited fields that do not otherwise resemble groups:

// proto2.proto
syntax="proto2";
message Proto2 {
  optional group Delimited = 1 {
    optional string str = 2;
  }
}
// editions.proto
edition="2023";
message Editions {
  Editions delimited = 1 [features.message_encoding = DELIMITED];
  string str = 2;
}
$ protoc proto2.proto --encode=Proto2 <<< 'Delimited: { str: "hello" }' > t.bin
$ protoc --experimental_editions editions.proto --decode=Editions <t.bin
Editions {
  str: "hello"
}

If protoc accepts a features.message_encoding = DELIMITED field where the field's message type is not a child of the message containing the field (and wow, is that a complicated clause), then protodesc should as well. I think this is clearly a bug in the editions support in protodesc.

Editions decouple a number of features, which results in a relaxing of what used to be invariants. This is one such case.

It looks like there is one proto2-group related aspect that does come along with the use of delimited message encoding: the text format needs to use the message name instead of the field name, in order to maintain backwards compatibility for proto2 groups that are migrated to editions.

Relevant thread in this other issue: protocolbuffers/protobuf#16239

Thanks for the clarification, and validation against my assumptions. 👍

I thought protoc raised a warning which made me implement it the same way in Go protobuf. If protoc does not (or no longer) raise(s) a warning about this, then shouldn't Go protobuf either.

Before making any changes related to this, it would likely be best to await the outcome of protocolbuffers/protobuf#16239. It turns out there are other issues related to fields with delimited encoding and proto2-group compatibility. The biggest issue seems to be related to gen-code and how the group type name (not the field name) was used to generate field names (and accessor and mutator methods) in some languages. It's possible that v27.0 of protoc might pivot to what's already implemented here, banning the use of "delimited" except in cases that match proto2 groups, and then relax these constraints in a future edition.

FYI, this appears to be fixed as of protocolbuffers/protobuf-go@a18684d.

Thanks!