google / flatbuffers

FlatBuffers: Memory Efficient Serialization Library

Home Page:https://flatbuffers.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make property renaming optional

TJKoury opened this issue · comments

I am integrating Flatbuffers with existing data standards, it is mandatory to use the exact property names / namespaces contained in these standards going forward, including all capitalization.

I've hacked in a quick fix to remove MakeCamel, MakeScreamingCamel, and MakeCamelCase, as well as disabling this warning. I can work on a pull that puts disabling these behind a flag, for users in my situation, if it is something that would be accepted.

I would be open to this, but not as a flag, but an attribute in the schema file itself. Something like

SomeSchema.fbs

attribute "keep_naming"

table A {
soMeFunkyNam_ing: int (keep_naming)
}

This way the schema is explicitly opting out of the style guide and clearly states that.

If it was a flag option, I think it would be confusing.

That is an interesting option, and makes the implementation more portable.

I like the idea of an attribute, but I guess I'd need to decorate every struct/table/enum with this (hopefully not every field!). Are file or namespace-scope attributes possible?

I think he’s talking about a schema-level flag, at the top of the file, stating that everything within it should not be altered. This is good, since it improves portability: if I give you a schema, you should be able to reproduce the code without knowing what flags I used.

Wait, why are you closing this issue? Now there's no open issue for this feature request AFAIK since I closed #7128.

Attributes can be applied to fields or types (i.e. struct, table, enum). I don't think they can be attached to a schema presently, but I agree this would be the most useful scope for using this approach.

I agree, let’s keep it open and flesh out an implementation.

Our parser does handle file-level attributes, but that is mostly for use in reflection as the code generators won't know what to do with it. I would probably want to leave that as is.

We also have a special native_include identifier that is top-level, so we could do somethin similar if we had too.

I am thinking we should introduce a pragma concept, as a way to tell the flatc compiler how to generate the code. Thus this feature would be something like:

-- some.fbs

pragma keep_identifier_naming

table inTerestingNaMing {
  s-o-m-e-ODDBALLfield:int;
}

And everything under the pragma (lexical scope) would use the specified naming and not modified it at all.

Individual generators would be free to ignore the pragma command, that way we don't have to implement this feature for all languages at once.

We would also probably need to add a new attribute to override this, something like: default-naming

-- some.fbs

pragma keep_identifier_naming

table inTerestingNaMing {
  s-o-m-e-ODDBALLfield:int (default-naming);
}

And the field tagged with default-naming will always use the language specified default.

Thoughts?

Thoughts?

Sounds good to me.

I like it, though the “default naming” might be overkill / confusing. I envision this pragma to be used for compatibility purposes like my use case, tagging individual fields might lead to some confusion if the override pragma is missing.

IMO, it would be better to just not support global "keep_naming" + opt out logic, since its such a niche use case... its not too hard for a minority of users to type (keep_naming) everywhere they need.

If we do go ahead with pragmas and stuff, the parser should contain this logic so that from the code generators' perspective, its as if the user typed (keep_naming) on all fields.

I also wonder if this feature will expand, maybe people would want to (keep_naming) for some languages and not others? Maybe they'll want to override our ideas of when to use CamelCase vs snake_case... I doubt fine grained configurability is worth the complexity, but if it turns out to be, we should do a little upfront design.

In my opinion, having different conventions per-language sort of defeats the interoperability; language agnostic inputs / outputs like the IDL and the typed JSON Schema output should reflect what you expect no matter what language you use. Otherwise you are stuck trying to translate field names according to a rule set that’s only available inside the binary.

In my opinion, having different conventions per-language sort of defeats the interoperability; language agnostic inputs / outputs like the IDL and the typed JSON Schema output should reflect what you expect no matter what language you use. Otherwise you are stuck trying to translate field names according to a rule set that’s only available inside the binary.

I disagree. The schema is the interoperable part, the individual languages should conform to their own conventions. You don't really have to translate field names with some obscure rules; it just simple formatting changes that should be self-evident.

IMO, it would be better to just not support global "keep_naming" + opt out logic, since its such a niche use case... its not too hard for a minority of users to type (keep_naming) everywhere they need.

That is also fine with me, since it's the easiest thing to do and rationalize.

The global “keep_naming” is good, having to do it for each field would be a nightmare, I have over 4000 fields I need to support.

@dbaileychess:

I disagree. The schema is the interoperable part, the individual languages should conform to their own conventions.
You don't really have to translate field names with some obscure rules; it just simple formatting changes that should be self-evident.

They are simple rules, but this is extra cognitive load that developers shouldn't need to shoulder - they've got enough other stuff to worry about. My argument would be why not honor the schema as-written to the degree possible, at least optionally? I'd like to be able to refer to the nice, small schema definitions when writing code that interfaces with a particular message, and not the generated code to be sure what a field is called. And I'd like this to be consistent across as many languages as possible.

I agree with @TJKoury that having to decorate individual fields for this would be a nightmare. What if it were per-object (enum, struct, table)? This would be less bad, but I would prefer a file-scope setting.

This is largely water under the bridge, but having the code generator mangle names by default strikes me as wholly unnecessary. Are there actually languages which would forbid me from naming a structures with snake_case or fields in a class with CamelCase? If there are, some mangling seems unavoidable, but enforcing it by default, and inconsistently across all languages with no simple opt-out seems like is a choice that is worth revisiting.

I think I am not following what this is needed for yet.

If this is largely about the Python generated code not following the language standards, we should just fix that.

Can someone explain in what context "data standards" would require generated APIs for all languages to ignore language standards and all look uniform? That sounds very niche and generally undesirable. To me, it seems pretty important to use a language's code standards, being able to write the same names across languages seems of much smaller importance to me.

There are lots of code generators out there.. is it common for any of them to allow overriding the language standard? Does Protobuf have an option for it?

If we must have this option, given that is so niche, to me a flatc flag makes more sense than trying to introduce schema-wide attributes (which is also problematic in terms of scoping). Also note, the current attribute keyword declares allowed attributes, it does not set them to any value globally. If for whatever reason we decide we must make it a schema attribute instead, I'd recommend simply making it a table attribute that applies to all fields, to save on typing, rather than "global".

@aardappel With utmost respect, there is no such thing as universal “language standards”. There are language conventions, and those are at best contested, and not enforced except in some extreme edge cases by any compiler or interpreter.

On the other hand, we gigabytes of currently written code running in prod on mission critical systems that refer to the exact properties referenced in that IDL I have written.

Having the names unchanged is the difference between being able to use the generated code or not.

Congrats on the move BTW, excited to see what you are doing in your new venture!

I'd disagree, with the exception of C/C++, all languages supported by FlatBuffers have a very specific standards for their identifiers, defined by the language definition or community. They are not always enforced, but using anything other than these standards would be very unwelcome by almost all users of these languages.

I understand your perspective, which is why I think adding an opt-in pragma is the way to go, as it does not force anyone to change anything, no surprises for current users and no limitations for "special cases". It also allows flexibility for things that are "not always enforced" by "language definitions" or "community".

Thanks for the discussion. I think I am leaning to one of two options:

  1. If you really want the naming to be applied globally, it might be best to do it as a flatc flag option. It would be similar to cpp-field-case-style, but obviously language agnostic.

  2. If you want more control, we could add a new attribute to the schema file that could be applied at the field- or table- level. I think we should hold off on the pragma idea for now, as I think that is adding too much machinery at the moment, and as @CasperN points out is for a niche feature. I think this is a fine compromise at the moment.

Let me know your thoughts.

I have a slight preference for an attribute over a flag, but its not very strong.

I'm mostly concerned with maintainability.

A naive implementation would touch almost everywhere and be a nightmare to maintain. Imo, the "right" way of doing this would involve first refactoring every code generator. Everywhere where we interact with the name of a type, we should instead interact with an accessor class that can apply the local language policy. This class can also look up either a flag value or a magic attribute to override the policy. It would still touch every code generator but at least be more maintainable.

class Namer {
  // Config varies per language.
  struct Config {
    types: CapitalCamelCase;
    methods: SnakeCase;
    constants: ScreamingSnakeCase;
    // ... etc
  };
  Namer(Config);

  std::string Type(Definition&) {
    // Check for magic attribute or flag
    // apply Config
  }
  std::string Method(Definition&);
  std::string Constant(Definition&);
};

I just recently upgraded to 2.0.0 and noticed the case style warnings. I'm not a major contributor to FlatBuffers yet, but I still wanted to comment as a user.

I think it's understandable that flatc defaults to Google code style. However, so many other projects with different C++ code styles are using FlatBuffers. Especially for projects that use the object API, table/field names strongly affect the overall code style of the project.

For example, our internal C++ style guide requires camelCase C++ member names. We also heavily utilize the object API. A normal access to table looks like this in our project: FlatbufferTableT->memberOne which suits our overall code style well.

I think it would be really great if flatc just didn't care about field or table name styles, or let the user specify it.

Alternatively, a global field-case-style flatc option would be really helpful.

The flatc flag would probably be the easiest to implement technically, and transparent to current users. The issue as mentioned above is portability; two users would create different codebases depending on their flags. Adding a schema-wide, top level attribute makes it portable.

A flatc flag has the additional advantage of allowing users to choose which languages to apply this to. There may be users that have use cases for overriding one particular language but not the others. They would not be served by having this feature as attributes.

OK, sounds like a new flag would be ideal. I will deprecate the cpp-field-case-style and transition that over to a new field-case-style and object-case-style flags that can specify the intended output case.

@CasperN I did start on one part of the refactoring, making out our disparate MakeSnake MakeCamel functions use the same common ConvertCase. So your intended refactoring could build off of that.

I will deprecate the cpp-field-case-style and transition that over to a new field-case-style and object-case-style flags that can specify the intended output case.

I think we still need to dive deeper into design

By choosing global overrides, we're telling users with weird casing preferences to invoke flatc one language at a time. This is fine by me but we should explicitly acknowledge the choice. We don't need to support per-field overrides right?

In addition to fields and objects, there are methods, constants, functions, enums, namespaces/modules/packages, and filenames. Maybe even macros and exceptions too, though I don't know any present usage. Presumably we'd eventually grow a flag for all of these?

(bike shedding: I'd prefer the term "type" over "object" since we have an "object API" and not all languages are object oriented)

What if these categories aren't fine grained enough for some style guides, e,.g. are there styles with different casing for global constants and class constants?

What are the arguments to these flags? is it

  • "" (Unspecified: the local code generator chooses)
  • keep (uses whatever is in the schema)
  • kCamelCase
  • smallCamelCase
  • CapitalCamelCase
  • snake_case
  • SCREAMING_SNAKE_CASE

My recommendation is to simply have a preserve-property-name flag. No need for that level of granularity. Any user can simply follow their own conventions, whether or not those conventions are already present in flatc.

My recommendation is to simply have a preserve-property-name flag. No need for that level of granularity. Any user can simply follow their own conventions, whether or not those conventions are already present in flatc.

IMO, preserve-property-name seems too niche to justify by itself. On the other hand, non-google style guides seem a lot more prevalent. "keep casing" can be a simple specialization of the general feature.

I agree in principle, at the same time it is a lot more work since there are infinite style guides and “best practices” to pull from. Seems like a very large task to tackle to prevent a behavior not present in other schema parser/code generators.

One of the more popular code generators (protobufs) has strict styling like we do (https://developers.google.com/protocol-buffers/docs/style#message_and_field_names). @TJKoury do you have examples of other code generators that output custom identifier names so we can see how they approached that problem?

One negative of having user-custom naming in the schema file is, it is practically impossible for us to convert it back to a standard casing (i.e., going from sOm_eWeirdNamE to some_werid_name). Thus all languages that generate code from that schema are forced to use the specified naming. This is the benefit of having a standard naming in the schema, we can always convert some_werid_name into a list of tokens (some weird name) and generate any output style we want (within reason).

@TJKoury what is the casing you want to use?

@TJKoury what is the casing you want to use?

I'm not @TJKoury but what I want is the identifiers in my schema left as-is.

@Bklyn I understand your request, but it might make the most sense to do the conversion from CamelCase/snake_case to the format you need. If it is a non-standard format, I would be curious in knowing the format.

As mentioned above, some languages might not support/ or complain about odd formats. So codifying the odd format in the schema limits portability to an extent.

One of the more popular code generators (protobufs) has strict styling like we do (https://developers.google.com/protocol-buffers/docs/style#message_and_field_names).

FlatBuffers also has a de-facto standard styling (inherited from Google-style C++ since we don't modify the identifiers in C++ codegen) of e.g. snake_case for fields and UpperCamel for types. We should probably specify that more precisely in the docs. If you want to go along with the formatting of your language's standard like most users, you should still be using the FlatBuffers standard in your schema to make sure it works with all languages.

@dbaileychess I simply need the names to be left alone, since I’m integrating into wildly non-standard production code. Generators like this or this or any of the multitude of other generators usually don’t mess with your prop names.

It might just be my inexperience with some of the generators / languages like Lobster, but are there really scenarios where a lexer / parser / interpreter will throw errors if your property names are not snake / camel / etc? In my experience (especially with ECMAScript) almost anything goes.

@aardappel The Google Style Guide for ECMAScript seems to be at odds with the guide for C++. For example the CONSTANT_NAME style breaks the current parser.

Thanks for the links. The issue with those generators is, AFAICT they are for a single language output, so it is much easier to handle (or lack of handling) naming. On our hand, we have to support 13+ different languages, so any solution has to work well for them all.

I still haven't gotten an answer for what type of casing you need to support. From https://spacedatastandards.org/#/code, it looks like you are using SCREAMING_CAPS, which is a casing we already support.

Also, in your original post:

it is mandatory to use the exact property names / namespaces contained in these standards going forward, including all capitalization.

That seems very fragile IMO. A major feature of flatbuffers (and protobufs) is that naming is not that important (unlike in JSON). Two sides of the wire could use different languages with their own naming convention, and the data is still serializable between them.

@aardappel The Google Style Guide for ECMAScript seems to be at odds with the guide for C++. For example the CONSTANT_NAME style breaks the current parser.

That doesn't matter? FlatBuffers follows the Google C++ style, not the JS one. For JS we transform identifiers to fit that style, like we do for all languages.

@dbaileychess Here's one I used recently that does multiple languages. I understand that feature of Flatbuffers and it is very powerful, would not want to take away from the ability to have different properties resolve to the same field. However, for my use case, it is absolutely required.

I will give you a brief example as to why:

If you go to this link, you will see Earth Orientation Parameter (EOP) Data and Space WX data, both in 'Legacy' and 'CSV' code. There are operational systems at NASA, NOAA, JAXA, ESA, etc., not to mention private sector, that consume one or both of these.

You'll notice that the names are completely different for the same data column, "Adj Lst81" vs "F10.7_ADJ_CENTER81".

I work with this data and I still went back and triple-checked it while writing this comment to make sure those are the equivalent headers.

There have been efforts to standardize, through CCSDS, and these fall short.

The generating / parsing tools written by all the agencies across language barriers, governments, corporations, etc., are non-compliant with any standard, and not compliant with each other. Generally there is only a few people in each organization that understand the concepts underpinning a particular property, and they usually are not the people writing the code.

All that to say, if there is a property name that we can get the entire community to agree on, it needs to stay exactly that way, regardless of language. The lead time for CCSDS Blue Book data standards is measured in literal decades, and every message has to be drafted in language-agnostic Key Value Notation (KVN) to be approved, with a working serialization test.

Property names are fixed, and non-negotiable, and do not follow any specific notation, though as you mentioned most of the recent text-based standards do use SCREAMING_CAPS. I did have a problem with Flatbuffers not supporting this in all cases, it generates class files t_h_a_t_l_o_o_k_l_i_k_e_t_h_i_s, because it is finding caps on every letter when converting from camel to snake.

Adding a flag is a nice way to make it convenient for people like me that have a hard requirement and don't want to fork the Flatbuffers library / create a new CI process just for a simple fix. If it is too much of a burden on the community or against the core ethos then I understand.

That doesn't matter? FlatBuffers follows the Google C++ style, not the JS one. For JS we transform identifiers to fit that style, like we do for all languages.

@aardappel That was my point, the JavaScript / TypeScript generated does not follow the Google style guide for that language. So you are intentionally going against best practice / convention / style for that language because it suits your purpose. Which is fine, at the same time it might be a good idea to allow for someone to disagree with you while still using your library.

Not to belabor the point, I have absolutely been in situations where “does the column header have an underscore or not” is the question that decides if critical data is used for a mission.

Often the mission includes expensive hardware and sometimes the lives of astronauts.

Trying to figure out which language was used to create the .csv that was then converted to Excel and translated three times is an exercise in futility, doubly so when the parameters are alphabet soup from 10 different scientific disciplines.

The properties must be identical, full stop.

That quicktype.io seems to be opinionated in casing. At least the example on the front page takes the name "high score" in json and outputs person.highScore in whatever language that is.

I did have a problem with Flatbuffers not supporting this in all cases, it generates class files t_h_a_t_l_o_o_k_l_i_k_e_t_h_i_s, because it is finding caps on every letter when converting from camel to snake.

Yes, that is expected because our code generators assume the schema naming are following the snake_case/CamelCase conventions and uses that to spit out the language-specific naming.

That was my point, the JavaScript / TypeScript generated does not follow the Google style guide for that language. So you are intentionally going against best practice / convention / style for that language because it suits your purpose. Which is fine, at the same time it might be a good idea to allow for someone to disagree with you while still using your library.

Those are bugs in the implementation of those generators. They should be exporting it to the preferred style.

Adding a flag is a nice way to make it convenient for people like me that have a hard requirement and don't want to fork the Flatbuffers library / create a new CI process just for a simple fix. If it is too much of a burden on the community or against the core ethos then I understand.

I think it is reasonable to support this, and my questioning was just trying to understand if we could provide you what you want now, instead of waiting until we refactor stuff to make this possible.

Not to belabor the point, I have absolutely been in situations where “does the column header have an underscore or not” is the question that decides if critical data is used for a mission. Often the mission includes expensive hardware and sometimes the lives of astronauts.

That's scary that lives depend on naming to match fields...

That's scary that lives depend on naming to match fields

It’s terrifying. I have written regular expressions that look like short stories trying to match every possible convolution of a property name.

So you are intentionally going against best practice / convention / style for that language because it suits your purpose.

Don't assume intention. The JS port was made by a variety of people thru the years, and mistakes may have been made. We want to best support whatever the language standard is, and if it isn't, we should fix it.

Also, for C++ we have to pick a specific style guide because there's multiple standards, but for most other languages there is just one, and it doesn't need to be Google flavored.

Don't assume intention.

True, a form of Hanlon’s Razor is at play here.

but for most other languages there is just one

It would be helpful to have references called out for each so we know with what we are striving to comply.

This issue is stale because it has been open 6 months with no activity. Please comment or label not-stale, or this will be closed in 14 days.

Commenting to keep this alive

@Bklyn This is still very relevant and I'm running into it nearly every day trying to get my project off the ground. My hacks stand for now.

Unfortunately I do not have time at the at the moment to go through the source and implement changes.