Make property renaming optional

Question

Make property renaming optional

TJKoury opened this issue 2 years ago · comments

I am integrating Flatbuffers with existing data standards, it is mandatory to use the exact property names / namespaces contained in these standards going forward, including all capitalization.

I've hacked in a quick fix to remove MakeCamel, MakeScreamingCamel, and MakeCamelCase, as well as disabling this warning. I can work on a pull that puts disabling these behind a flag, for users in my situation, if it is something that would be accepted.

Derek Bailey · Answer 1 · Tue Feb 22 2022 12:00:41 GMT+0800 (China Standard Time)

I would be open to this, but not as a flag, but an attribute in the schema file itself. Something like

SomeSchema.fbs

attribute "keep_naming"

table A {
soMeFunkyNam_ing: int (keep_naming)
}

This way the schema is explicitly opting out of the style guide and clearly states that.

If it was a flag option, I think it would be confusing.

TJKoury · Answer 2 · Tue Feb 22 2022 17:04:55 GMT+0800 (China Standard Time)

That is an interesting option, and makes the implementation more portable.

Caleb Epstein · Answer 3 · Thu Feb 24 2022 10:38:24 GMT+0800 (China Standard Time)

I like the idea of an attribute, but I guess I'd need to decorate every struct/table/enum with this (hopefully not every field!). Are file or namespace-scope attributes possible?

TJKoury · Answer 4 · Thu Feb 24 2022 13:16:52 GMT+0800 (China Standard Time)

I think he’s talking about a schema-level flag, at the top of the file, stating that everything within it should not be altered. This is good, since it improves portability: if I give you a schema, you should be able to reproduce the code without knowing what flags I used.

Caleb Epstein · Answer 5 · Thu Feb 24 2022 21:50:19 GMT+0800 (China Standard Time)

Wait, why are you closing this issue? Now there's no open issue for this feature request AFAIK since I closed #7128.

Attributes can be applied to fields or types (i.e. struct, table, enum). I don't think they can be attached to a schema presently, but I agree this would be the most useful scope for using this approach.

TJKoury · Answer 6 · Thu Feb 24 2022 23:17:48 GMT+0800 (China Standard Time)

I agree, let’s keep it open and flesh out an implementation.

Derek Bailey · Answer 7 · Fri Feb 25 2022 13:19:39 GMT+0800 (China Standard Time)

Our parser does handle file-level attributes, but that is mostly for use in reflection as the code generators won't know what to do with it. I would probably want to leave that as is.

We also have a special native_include identifier that is top-level, so we could do somethin similar if we had too.

I am thinking we should introduce a pragma concept, as a way to tell the flatc compiler how to generate the code. Thus this feature would be something like:

-- some.fbs

pragma keep_identifier_naming

table inTerestingNaMing {
  s-o-m-e-ODDBALLfield:int;
}

And everything under the pragma (lexical scope) would use the specified naming and not modified it at all.

Individual generators would be free to ignore the pragma command, that way we don't have to implement this feature for all languages at once.

We would also probably need to add a new attribute to override this, something like: default-naming

-- some.fbs

pragma keep_identifier_naming

table inTerestingNaMing {
  s-o-m-e-ODDBALLfield:int (default-naming);
}

And the field tagged with default-naming will always use the language specified default.

Thoughts?

Caleb Epstein · Answer 8 · Fri Feb 25 2022 23:50:41 GMT+0800 (China Standard Time)

Thoughts?

Sounds good to me.

TJKoury · Answer 9 · Sat Feb 26 2022 00:16:23 GMT+0800 (China Standard Time)

I like it, though the “default naming” might be overkill / confusing. I envision this pragma to be used for compatibility purposes like my use case, tagging individual fields might lead to some confusion if the override pragma is missing.

Casper · Answer 10 · Sat Feb 26 2022 03:09:33 GMT+0800 (China Standard Time)

IMO, it would be better to just not support global "keep_naming" + opt out logic, since its such a niche use case... its not too hard for a minority of users to type (keep_naming) everywhere they need.

If we do go ahead with pragmas and stuff, the parser should contain this logic so that from the code generators' perspective, its as if the user typed (keep_naming) on all fields.

I also wonder if this feature will expand, maybe people would want to (keep_naming) for some languages and not others? Maybe they'll want to override our ideas of when to use CamelCase vs snake_case... I doubt fine grained configurability is worth the complexity, but if it turns out to be, we should do a little upfront design.

TJKoury · Answer 11 · Sat Feb 26 2022 03:27:40 GMT+0800 (China Standard Time)

In my opinion, having different conventions per-language sort of defeats the interoperability; language agnostic inputs / outputs like the IDL and the typed JSON Schema output should reflect what you expect no matter what language you use. Otherwise you are stuck trying to translate field names according to a rule set that’s only available inside the binary.

Derek Bailey · Answer 12 · Sat Feb 26 2022 04:24:00 GMT+0800 (China Standard Time)

In my opinion, having different conventions per-language sort of defeats the interoperability; language agnostic inputs / outputs like the IDL and the typed JSON Schema output should reflect what you expect no matter what language you use. Otherwise you are stuck trying to translate field names according to a rule set that’s only available inside the binary.

I disagree. The schema is the interoperable part, the individual languages should conform to their own conventions. You don't really have to translate field names with some obscure rules; it just simple formatting changes that should be self-evident.

IMO, it would be better to just not support global "keep_naming" + opt out logic, since its such a niche use case... its not too hard for a minority of users to type (keep_naming) everywhere they need.

That is also fine with me, since it's the easiest thing to do and rationalize.

TJKoury · Answer 13 · Sat Feb 26 2022 04:50:47 GMT+0800 (China Standard Time)

The global “keep_naming” is good, having to do it for each field would be a nightmare, I have over 4000 fields I need to support.

Caleb Epstein · Answer 14 · Sat Feb 26 2022 06:19:55 GMT+0800 (China Standard Time)

@dbaileychess:

I disagree. The schema is the interoperable part, the individual languages should conform to their own conventions.
You don't really have to translate field names with some obscure rules; it just simple formatting changes that should be self-evident.

They are simple rules, but this is extra cognitive load that developers shouldn't need to shoulder - they've got enough other stuff to worry about. My argument would be why not honor the schema as-written to the degree possible, at least optionally? I'd like to be able to refer to the nice, small schema definitions when writing code that interfaces with a particular message, and not the generated code to be sure what a field is called. And I'd like this to be consistent across as many languages as possible.

I agree with @TJKoury that having to decorate individual fields for this would be a nightmare. What if it were per-object (enum, struct, table)? This would be less bad, but I would prefer a file-scope setting.

This is largely water under the bridge, but having the code generator mangle names by default strikes me as wholly unnecessary. Are there actually languages which would forbid me from naming a structures with snake_case or fields in a class with CamelCase? If there are, some mangling seems unavoidable, but enforcing it by default, and inconsistently across all languages with no simple opt-out seems like is a choice that is worth revisiting.

Wouter van Oortmerssen · Answer 15 · Sat Feb 26 2022 06:42:11 GMT+0800 (China Standard Time)

I think I am not following what this is needed for yet.

If this is largely about the Python generated code not following the language standards, we should just fix that.

Can someone explain in what context "data standards" would require generated APIs for all languages to ignore language standards and all look uniform? That sounds very niche and generally undesirable. To me, it seems pretty important to use a language's code standards, being able to write the same names across languages seems of much smaller importance to me.

There are lots of code generators out there.. is it common for any of them to allow overriding the language standard? Does Protobuf have an option for it?

If we must have this option, given that is so niche, to me a flatc flag makes more sense than trying to introduce schema-wide attributes (which is also problematic in terms of scoping). Also note, the current attribute keyword declares allowed attributes, it does not set them to any value globally. If for whatever reason we decide we must make it a schema attribute instead, I'd recommend simply making it a table attribute that applies to all fields, to save on typing, rather than "global".

TJKoury · Answer 16 · Sat Feb 26 2022 06:57:30 GMT+0800 (China Standard Time)

@aardappel With utmost respect, there is no such thing as universal “language standards”. There are language conventions, and those are at best contested, and not enforced except in some extreme edge cases by any compiler or interpreter.

On the other hand, we gigabytes of currently written code running in prod on mission critical systems that refer to the exact properties referenced in that IDL I have written.

Having the names unchanged is the difference between being able to use the generated code or not.

Congrats on the move BTW, excited to see what you are doing in your new venture!

Wouter van Oortmerssen · Answer 17 · Sat Feb 26 2022 07:25:29 GMT+0800 (China Standard Time)

I'd disagree, with the exception of C/C++, all languages supported by FlatBuffers have a very specific standards for their identifiers, defined by the language definition or community. They are not always enforced, but using anything other than these standards would be very unwelcome by almost all users of these languages.

TJKoury · Answer 18 · Sat Feb 26 2022 08:12:45 GMT+0800 (China Standard Time)

I understand your perspective, which is why I think adding an opt-in pragma is the way to go, as it does not force anyone to change anything, no surprises for current users and no limitations for "special cases". It also allows flexibility for things that are "not always enforced" by "language definitions" or "community".

Derek Bailey · Answer 19 · Tue Mar 01 2022 00:02:07 GMT+0800 (China Standard Time)

Thanks for the discussion. I think I am leaning to one of two options:

If you really want the naming to be applied globally, it might be best to do it as a flatc flag option. It would be similar to cpp-field-case-style, but obviously language agnostic.
If you want more control, we could add a new attribute to the schema file that could be applied at the field- or table- level. I think we should hold off on the pragma idea for now, as I think that is adding too much machinery at the moment, and as @CasperN points out is for a niche feature. I think this is a fine compromise at the moment.

Let me know your thoughts.

Casper · Answer 20 · Tue Mar 01 2022 05:30:08 GMT+0800 (China Standard Time)

I have a slight preference for an attribute over a flag, but its not very strong.

I'm mostly concerned with maintainability.

A naive implementation would touch almost everywhere and be a nightmare to maintain. Imo, the "right" way of doing this would involve first refactoring every code generator. Everywhere where we interact with the name of a type, we should instead interact with an accessor class that can apply the local language policy. This class can also look up either a flag value or a magic attribute to override the policy. It would still touch every code generator but at least be more maintainable.

class Namer {
  // Config varies per language.
  struct Config {
    types: CapitalCamelCase;
    methods: SnakeCase;
    constants: ScreamingSnakeCase;
    // ... etc
  };
  Namer(Config);

  std::string Type(Definition&) {
    // Check for magic attribute or flag
    // apply Config
  }
  std::string Method(Definition&);
  std::string Constant(Definition&);
};

Anthony V. Ozdemir · Answer 21 · Tue Mar 01 2022 06:04:57 GMT+0800 (China Standard Time)

I just recently upgraded to 2.0.0 and noticed the case style warnings. I'm not a major contributor to FlatBuffers yet, but I still wanted to comment as a user.

I think it's understandable that flatc defaults to Google code style. However, so many other projects with different C++ code styles are using FlatBuffers. Especially for projects that use the object API, table/field names strongly affect the overall code style of the project.

For example, our internal C++ style guide requires camelCase C++ member names. We also heavily utilize the object API. A normal access to table looks like this in our project: FlatbufferTableT->memberOne which suits our overall code style well.

I think it would be really great if flatc just didn't care about field or table name styles, or let the user specify it.

Alternatively, a global field-case-style flatc option would be really helpful.

TJKoury · Answer 22 · Tue Mar 01 2022 07:32:41 GMT+0800 (China Standard Time)

The flatc flag would probably be the easiest to implement technically, and transparent to current users. The issue as mentioned above is portability; two users would create different codebases depending on their flags. Adding a schema-wide, top level attribute makes it portable.

Wouter van Oortmerssen · Answer 23 · Tue Mar 01 2022 07:40:44 GMT+0800 (China Standard Time)

A flatc flag has the additional advantage of allowing users to choose which languages to apply this to. There may be users that have use cases for overriding one particular language but not the others. They would not be served by having this feature as attributes.

Derek Bailey · Answer 24 · Tue Mar 01 2022 09:16:58 GMT+0800 (China Standard Time)

OK, sounds like a new flag would be ideal. I will deprecate the cpp-field-case-style and transition that over to a new field-case-style and object-case-style flags that can specify the intended output case.

@CasperN I did start on one part of the refactoring, making out our disparate MakeSnake MakeCamel functions use the same common ConvertCase. So your intended refactoring could build off of that.

Casper · Answer 25 · Wed Mar 02 2022 01:00:58 GMT+0800 (China Standard Time)

I will deprecate the cpp-field-case-style and transition that over to a new field-case-style and object-case-style flags that can specify the intended output case.

I think we still need to dive deeper into design

By choosing global overrides, we're telling users with weird casing preferences to invoke flatc one language at a time. This is fine by me but we should explicitly acknowledge the choice. We don't need to support per-field overrides right?

In addition to fields and objects, there are methods, constants, functions, enums, namespaces/modules/packages, and filenames. Maybe even macros and exceptions too, though I don't know any present usage. Presumably we'd eventually grow a flag for all of these?

(bike shedding: I'd prefer the term "type" over "object" since we have an "object API" and not all languages are object oriented)

What if these categories aren't fine grained enough for some style guides, e,.g. are there styles with different casing for global constants and class constants?

What are the arguments to these flags? is it

"" (Unspecified: the local code generator chooses)
keep (uses whatever is in the schema)
kCamelCase
smallCamelCase
CapitalCamelCase
snake_case
SCREAMING_SNAKE_CASE

TJKoury · Answer 26 · Wed Mar 02 2022 01:06:55 GMT+0800 (China Standard Time)

My recommendation is to simply have a preserve-property-name flag. No need for that level of granularity. Any user can simply follow their own conventions, whether or not those conventions are already present in flatc.

Casper · Answer 27 · Wed Mar 02 2022 01:15:26 GMT+0800 (China Standard Time)

My recommendation is to simply have a preserve-property-name flag. No need for that level of granularity. Any user can simply follow their own conventions, whether or not those conventions are already present in flatc.

IMO, preserve-property-name seems too niche to justify by itself. On the other hand, non-google style guides seem a lot more prevalent. "keep casing" can be a simple specialization of the general feature.

TJKoury · Answer 28 · Wed Mar 02 2022 02:04:13 GMT+0800 (China Standard Time)

I agree in principle, at the same time it is a lot more work since there are infinite style guides and “best practices” to pull from. Seems like a very large task to tackle to prevent a behavior not present in other schema parser/code generators.

Derek Bailey · Answer 29 · Thu Mar 03 2022 01:16:09 GMT+0800 (China Standard Time)

One of the more popular code generators (protobufs) has strict styling like we do (https://developers.google.com/protocol-buffers/docs/style#message_and_field_names). @TJKoury do you have examples of other code generators that output custom identifier names so we can see how they approached that problem?

One negative of having user-custom naming in the schema file is, it is practically impossible for us to convert it back to a standard casing (i.e., going from sOm_eWeirdNamE to some_werid_name). Thus all languages that generate code from that schema are forced to use the specified naming. This is the benefit of having a standard naming in the schema, we can always convert some_werid_name into a list of tokens (some weird name) and generate any output style we want (within reason).

@TJKoury what is the casing you want to use?

Caleb Epstein · Answer 30 · Thu Mar 03 2022 01:40:29 GMT+0800 (China Standard Time)

@TJKoury what is the casing you want to use?

I'm not @TJKoury but what I want is the identifiers in my schema left as-is.

Derek Bailey · Answer 31 · Thu Mar 03 2022 01:47:21 GMT+0800 (China Standard Time)

@Bklyn I understand your request, but it might make the most sense to do the conversion from CamelCase/snake_case to the format you need. If it is a non-standard format, I would be curious in knowing the format.

As mentioned above, some languages might not support/ or complain about odd formats. So codifying the odd format in the schema limits portability to an extent.

Wouter van Oortmerssen · Answer 32 · Thu Mar 03 2022 02:41:37 GMT+0800 (China Standard Time)

One of the more popular code generators (protobufs) has strict styling like we do (https://developers.google.com/protocol-buffers/docs/style#message_and_field_names).

FlatBuffers also has a de-facto standard styling (inherited from Google-style C++ since we don't modify the identifiers in C++ codegen) of e.g. snake_case for fields and UpperCamel for types. We should probably specify that more precisely in the docs. If you want to go along with the formatting of your language's standard like most users, you should still be using the FlatBuffers standard in your schema to make sure it works with all languages.

TJKoury · Answer 33 · Thu Mar 03 2022 03:05:47 GMT+0800 (China Standard Time)

@dbaileychess I simply need the names to be left alone, since I’m integrating into wildly non-standard production code. Generators like this or this or any of the multitude of other generators usually don’t mess with your prop names.

It might just be my inexperience with some of the generators / languages like Lobster, but are there really scenarios where a lexer / parser / interpreter will throw errors if your property names are not snake / camel / etc? In my experience (especially with ECMAScript) almost anything goes.

TJKoury · Answer 34 · Thu Mar 03 2022 03:11:55 GMT+0800 (China Standard Time)

@aardappel The Google Style Guide for ECMAScript seems to be at odds with the guide for C++. For example the CONSTANT_NAME style breaks the current parser.

Derek Bailey · Answer 35 · Thu Mar 03 2022 03:18:32 GMT+0800 (China Standard Time)

Thanks for the links. The issue with those generators is, AFAICT they are for a single language output, so it is much easier to handle (or lack of handling) naming. On our hand, we have to support 13+ different languages, so any solution has to work well for them all.

I still haven't gotten an answer for what type of casing you need to support. From https://spacedatastandards.org/#/code, it looks like you are using SCREAMING_CAPS, which is a casing we already support.

Also, in your original post:

it is mandatory to use the exact property names / namespaces contained in these standards going forward, including all capitalization.

That seems very fragile IMO. A major feature of flatbuffers (and protobufs) is that naming is not that important (unlike in JSON). Two sides of the wire could use different languages with their own naming convention, and the data is still serializable between them.

Wouter van Oortmerssen · Answer 36 · Thu Mar 03 2022 03:51:29 GMT+0800 (China Standard Time)

@aardappel The Google Style Guide for ECMAScript seems to be at odds with the guide for C++. For example the CONSTANT_NAME style breaks the current parser.

That doesn't matter? FlatBuffers follows the Google C++ style, not the JS one. For JS we transform identifiers to fit that style, like we do for all languages.

TJKoury · Answer 37 · Thu Mar 03 2022 05:17:54 GMT+0800 (China Standard Time)

@dbaileychess Here's one I used recently that does multiple languages. I understand that feature of Flatbuffers and it is very powerful, would not want to take away from the ability to have different properties resolve to the same field. However, for my use case, it is absolutely required.

I will give you a brief example as to why:

If you go to this link, you will see Earth Orientation Parameter (EOP) Data and Space WX data, both in 'Legacy' and 'CSV' code. There are operational systems at NASA, NOAA, JAXA, ESA, etc., not to mention private sector, that consume one or both of these.

You'll notice that the names are completely different for the same data column, "Adj Lst81" vs "F10.7_ADJ_CENTER81".

I work with this data and I still went back and triple-checked it while writing this comment to make sure those are the equivalent headers.

There have been efforts to standardize, through CCSDS, and these fall short.

The generating / parsing tools written by all the agencies across language barriers, governments, corporations, etc., are non-compliant with any standard, and not compliant with each other. Generally there is only a few people in each organization that understand the concepts underpinning a particular property, and they usually are not the people writing the code.

All that to say, if there is a property name that we can get the entire community to agree on, it needs to stay exactly that way, regardless of language. The lead time for CCSDS Blue Book data standards is measured in literal decades, and every message has to be drafted in language-agnostic Key Value Notation (KVN) to be approved, with a working serialization test.

Property names are fixed, and non-negotiable, and do not follow any specific notation, though as you mentioned most of the recent text-based standards do use SCREAMING_CAPS. I did have a problem with Flatbuffers not supporting this in all cases, it generates class files t_h_a_t_l_o_o_k_l_i_k_e_t_h_i_s, because it is finding caps on every letter when converting from camel to snake.

Adding a flag is a nice way to make it convenient for people like me that have a hard requirement and don't want to fork the Flatbuffers library / create a new CI process just for a simple fix. If it is too much of a burden on the community or against the core ethos then I understand.

That doesn't matter? FlatBuffers follows the Google C++ style, not the JS one. For JS we transform identifiers to fit that style, like we do for all languages.

@aardappel That was my point, the JavaScript / TypeScript generated does not follow the Google style guide for that language. So you are intentionally going against best practice / convention / style for that language because it suits your purpose. Which is fine, at the same time it might be a good idea to allow for someone to disagree with you while still using your library.

TJKoury · Answer 38 · Thu Mar 03 2022 06:00:39 GMT+0800 (China Standard Time)

Not to belabor the point, I have absolutely been in situations where “does the column header have an underscore or not” is the question that decides if critical data is used for a mission.

Often the mission includes expensive hardware and sometimes the lives of astronauts.

Trying to figure out which language was used to create the .csv that was then converted to Excel and translated three times is an exercise in futility, doubly so when the parameters are alphabet soup from 10 different scientific disciplines.

The properties must be identical, full stop.

Derek Bailey · Answer 39 · Thu Mar 03 2022 06:03:18 GMT+0800 (China Standard Time)

That quicktype.io seems to be opinionated in casing. At least the example on the front page takes the name "high score" in json and outputs person.highScore in whatever language that is.

I did have a problem with Flatbuffers not supporting this in all cases, it generates class files t_h_a_t_l_o_o_k_l_i_k_e_t_h_i_s, because it is finding caps on every letter when converting from camel to snake.

Yes, that is expected because our code generators assume the schema naming are following the snake_case/CamelCase conventions and uses that to spit out the language-specific naming.

That was my point, the JavaScript / TypeScript generated does not follow the Google style guide for that language. So you are intentionally going against best practice / convention / style for that language because it suits your purpose. Which is fine, at the same time it might be a good idea to allow for someone to disagree with you while still using your library.

Those are bugs in the implementation of those generators. They should be exporting it to the preferred style.

Adding a flag is a nice way to make it convenient for people like me that have a hard requirement and don't want to fork the Flatbuffers library / create a new CI process just for a simple fix. If it is too much of a burden on the community or against the core ethos then I understand.

I think it is reasonable to support this, and my questioning was just trying to understand if we could provide you what you want now, instead of waiting until we refactor stuff to make this possible.

Not to belabor the point, I have absolutely been in situations where “does the column header have an underscore or not” is the question that decides if critical data is used for a mission. Often the mission includes expensive hardware and sometimes the lives of astronauts.

That's scary that lives depend on naming to match fields...

TJKoury · Answer 40 · Thu Mar 03 2022 06:09:26 GMT+0800 (China Standard Time)

That's scary that lives depend on naming to match fields

It’s terrifying. I have written regular expressions that look like short stories trying to match every possible convolution of a property name.

Wouter van Oortmerssen · Answer 41 · Thu Mar 03 2022 06:13:59 GMT+0800 (China Standard Time)

So you are intentionally going against best practice / convention / style for that language because it suits your purpose.

Don't assume intention. The JS port was made by a variety of people thru the years, and mistakes may have been made. We want to best support whatever the language standard is, and if it isn't, we should fix it.

Also, for C++ we have to pick a specific style guide because there's multiple standards, but for most other languages there is just one, and it doesn't need to be Google flavored.

TJKoury · Answer 42 · Thu Mar 03 2022 06:19:23 GMT+0800 (China Standard Time)

Don't assume intention.

True, a form of Hanlon’s Razor is at play here.

but for most other languages there is just one

It would be helpful to have references called out for each so we know with what we are striving to comply.

github-actions · Answer 43 · Sat Mar 04 2023 09:04:45 GMT+0800 (China Standard Time)

This issue is stale because it has been open 6 months with no activity. Please comment or label not-stale, or this will be closed in 14 days.

Caleb Epstein · Answer 44 · Mon Mar 06 2023 05:35:54 GMT+0800 (China Standard Time)

Commenting to keep this alive

TJKoury · Answer 45 · Sat Mar 18 2023 00:45:27 GMT+0800 (China Standard Time)

@Bklyn This is still very relevant and I'm running into it nearly every day trying to get my project off the ground. My hacks stand for now.

Unfortunately I do not have time at the at the moment to go through the source and implement changes.