google / flatbuffers

FlatBuffers: Memory Efficient Serialization Library

Home Page:https://flatbuffers.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RFC: Centralize naming convention

CasperN opened this issue · comments

Currently, the various code generators reinvent their own ways to escape keywords and recase identifiers to fit with local style.
DRY style concerns aside, the manual approach might not be working so well given #7137, #7138, #7139, #7140, and #7141. #7111 discusses overriding our casing conventions which would require touching almost everywhere, but would be easy given centralized naming.

In #7111 (comment) I outline a class that can be used to centralize naming conventions. I think something like:

class CaseManager {  // Feel free to bikeshed this name
 public:
  struct Config {
    Case types;
    Case constants;
    Case methods;
    Case functions;
    Case fields;
    Case variants;
    Case namespaces;
    Case file_nodes;
    // most languages escpae keywords by appending "_" while csharp prepends "@"
    std::string keyword_prefix;
    std::string keyword_suffix;
    std::string object_prefix;
    std::string object_suffix;
    std::string namespace_seperator;
  };
  CaseManager(Config config, std::set<std::string> keywords)
    : config_(config), keywords_(std::move(keywords)) {}

  // Formats `d.name` as a type, then escapes it if its a keyword.
  std::string Type(const Definition& d) const;
  // Formats `d.name` as a constant, then escapes it if its a keyword.
  std::string Constant(const Definition& d) const;
  // analogous const methods for Method, Function, Field, Variant, etc

  // Formats `d.name` as a type, then adds prefix and suffix, then escapes it if its a keyword.
  std::string ObjectType(const Definition& d) const;
 private:
  Config config_;
  std::set<std::string> keywords_;
};

I think of it similarly to data access objects (DAOs) except its for applying policies to our identifiers. This would be a pretty substantial change as it raises the abstraction level from "string manipulation code in every code generator" to "configuration per code generator", but I think it will help simplify our systems and make them more maintainable. What do we think?

@aardappel @dbaileychess @vglavnyy @krojew (is there an @all-maintainers?)

TODO:

Convert code generators

  • idl_gen_cpp
  • idl_gen_csharp
  • idl_gen_dart
  • idl_gen_go
  • idl_gen_grpc
  • idl_gen_java
  • #7198
  • idl_gen_lobster
  • idl_gen_lua
  • idl_gen_php
  • idl_gen_python
  • idl_gen_rust
  • idl_gen_swift
  • idl_gen_ts
  • idl_gen_json_schema
  • idl_gen_fbs
  • idl_gen_text

other stuff

  • Incorporate CommentConfig
  • Deprecate BaseGenerator::Namespace

Also, imo, case_fmt_.Type(x) tells me more about the semantics of the generated code than ToUpperCamelCase(x)

Testing mention @flatbuffers/maintainers

Thanks @CasperN, I think this would be a good utility class to have.

Class name might be NamingManager or something, since it does more than just case manipulations.

I would make overloads for each of the methods that take in a const std::string& as well.

How do we handle:

  1. Namespacing? Since these are often part of the name but may have different styling and separators (::, ., etc..)
  2. Paths? I saw some cases in the code where we are changing the case of a path (for generating an include file), which lead to some odd rules.
  3. Global/language overrides? Would we switch out the CaseManager that we pass to the code generator or would we set an override flag in the CaseManager?

Yes, something like this would refactor the existing code, and centralize "how to make something camelcase" etc in one place. Then any overriding flags are suddenly simple to implement.

Each language would instantiate a default.

Generally sounds good to me? Of course a fair bit of work to do this, and I am not that convinced that these overriding options are that desirable, but I don't feel strongly either way.

I would make overloads for each of the methods that take in a const std::string& as well.

In the long run, it's not clear to me that we should be using strings everywhere. There's a semantic difference between a fully qualified type, ns_a.ns_b.ns_c.TableA; the type, TableA, and its namespace, ns_a.ns_b.ns_c; and maybe these should be different types. We use a ton of strings and lose some type system protections for it...

However, that would be a different mass refactoring, so I agree that we should support const std::string& overloads in the mean time.

Namespacing? Since these are often part of the name but may have different styling and separators (::, ., etc..)

It seems reasonable to me. I can add Case namespaces and std::string namespace_seperator to Config.
Note that there's some namespace manipulation built into the CodeGenerator base class, which we'd also have to rip out.

Paths? I saw some cases in the code where we are changing the case of a path (for generating an include file), which lead to some odd rules.

sgtm, I can add Case filenodes; (until filenames and directories aren't the same). We're using a subset of characters that's portable between unix-likes and windows and I think our lower layer file APIs handle / vs \ so we can act as if we're all on linux.

Global/language overrides? Would we switch out the CaseManager that we pass to the code generator or would we set an override flag in the CaseManager?

I was thinking we'd have a factory fn per language. flatc.cpp would

  1. invoke the factory e.g. RustDefaultNameManager
  2. do some manipulation based on flags, e.g. call SetObjectPrefix
  3. then pass it to the code generator

Of course a fair bit of work to do this, and I am not that convinced that these overriding options are that desirable, but I don't feel strongly either way.

Fair enough, I am 25% motivated by the case overriding FR, and 75% motivated by simplification/consistency

I'm not sure where the best place for this is, but there also tends to be a bit of statefulness that can be needed to account for some things. E.g., I encountered a situation where for typescript if I created a .fbs with:

enum foobar {
 EnumA,
}
table TableName {
  foobar:int;
  enum_member:foobar;
}

The generated code ends up with a constructor for TableNameT that looks like:

export class TableNameT {
constructor(
  public foobar: number = 0,
  public enum_member: foobar = foobar.EnumA,
){}

which results in a name conflict because within the context of the constructor, foobar refers to the variable, not the enum type. I imagine that there are similar types of pitfalls that must be tracked in other situations. The typescript codegen currently also has an AddImport call which imports types by there name, and if it notices that you are attempting to import multiple types with the same name, it will prepend the name of one of them with the namespace. I assume that these sorts of stateful, scope-based conflicts can also be encountered in other languages. Although whether this should be resolved by this proposal, by some other proposal, or just by better code design (e.g., the typescript codegen could always import with the namespace prepended, since it's just an internal thing; I bet there's some similar workaround for the object-API constructor issue), I do not know.

That's a good point, and there's a similar shadowing issue in #6845.

Although whether this should be resolved by this proposal, by some other proposal, or just by better code design (e.g., the typescript codegen could always import with the namespace prepended, since it's just an internal thing; I bet there's some similar workaround for the object-API constructor issue), I do not know.

I think always using fully qualified types might be the way to go... but does that actually solve your problem if foobar and TableName are both in the root namespace?

Fully qualified names don't resolve the issue I was noticing--but it could simplify the logic around the already-existing code to handle this with imports. To resolve what I was noticing, maybe there's some fancy typescript syntax to help (despite mostly talking about typescript here I don't actually develop in typescript much...); alternatively, you could enforce a naming convention for the generated members of the object API that guarantees that they will be different from any actual type names, but I don't know if that is feasible or desirable, and keeping track of what conventions everything is generated to follow to guarantee that they don't overlap sounds cumbersome.

#7143 is another casing related bug: Two fields are different in the schema but, due to case-style normalization, generate the same symbol, which is invalid.

Looks like @tira-misu found

  • #7148 : Another example requiring context aware escaping, due to field-name/type-name collision, like @jkuszmaul's example above
  • #7149 : a reason to not escape keyword identifiers in type script

@dbaileychess would you like to claim some subset of code generators to refactor? I don't mind doing it myself, but help is appreciated.

I'll take cpp/c sharp/lua/json schema/fbs/text

More relevant stuff #7156.

What's with the recent surge in this family of bugs? 🤔

More thoughts:

  • There's some facilities built into BaseGenerator e.g. WrapInNamespace, NamespaceDir, and CommentConfig which are similar to Namer.
    • These should be moved into Namer so it can also be used by Grpc generators and Bfbs generators which do not inherit from the base generator... though inheritance would've been nice so we won't have to use namer_. as a prefix everywhere.
  • Namer::Config is getting long. It could be split up but it doesn't feel too complex (to me) since the class is still declarative and stateless.
  • Having Namer "think" in terms of std::string& is really annoying since I have to add .name everywhere. I think I'll have to add those overloads.
    $ rg -o "namer_.\w+\(\w+.name\)" src/ | cut -d : -f 2  | sort | uniq -c | sort -nr | head
    36 namer_.Type(struct_def.name)
    33 namer_.Field(field.name)
    32 namer_.Method(field.name)
    21 namer_.Variable(field.name)
    17 namer_.Function(field.name)
    11 namer_.Type(enum_def.name)
    10 namer_.Variable(struct_def.name)
     6 namer_.ObjectType(struct_def.name)
  • Its unfortunate that ConvertCase requires an input case since we're not generally very careful about what that should be. I made the default input case in namer.h CamelCase since it worked in Rust, which I started with but it turns out that's not consistent and has to be configured. (sigh).
  • In some languages, many kinds of symbols are all the same case style. I have to manually check if namer_.Variable vs namer_.Field should be used because if I made that mistake... they have the same style so the generated code is identical. There may be latent bugs that only appear when users try to customize the style. We don't yet have tests that can capture this.
  • As expected, this effort is revealing a lot of subtle inconsistent edge cases that accumulated across many contributors, languages, and years.
    • There's inconsistent silliness like escaping keywords before converting case which has to be configurable
    • The style of the code generation varies a lot and how that aligns to Namer's style-guide-inspired kinds of symbols is not always clear.
    • I'll try to label what I think should be deprecated but changing the generated code is a breaking change for users so concentrating the technical debt into Namer and its config might be okay long term.
    • We're not consistent about usage of CodeWriter or simple strings, how identation is handled... there's a lot of simplification work that can happen - though its unclear to me what the 'forcing function' will be to make us do that.

Aside from addressing the keyword escaping bugs and enabling non-google style guides, I think this effort will at least make a dent in making the codebase more consistent

There's some facilities built into BaseGenerator e.g. WrapInNamespace, NamespaceDir, and CommentConfig which are similar to Namer.
These should be moved into Namer so it can also be used by Grpc generators and Bfbs generators which do not inherit from the base generator... though inheritance would've been nice so we won't have to use namer_. as a prefix everywhere.

SG

Namer::Config is getting long. It could be split up but it doesn't feel too complex (to me) since the class is still declarative and stateless.

We can tackle that later.

Having Namer "think" in terms of std::string& is really annoying since I have to add .name everywhere. I think I'll have to add those overloads.
$ rg -o "namer_.\w+(\w+.name)" src/ | cut -d : -f 2 | sort | uniq -c | sort -nr | head
36 namer_.Type(struct_def.name)
33 namer_.Field(field.name)
32 namer_.Method(field.name)
21 namer_.Variable(field.name)
17 namer_.Function(field.name)
11 namer_.Type(enum_def.name)
10 namer_.Variable(struct_def.name)
6 namer_.ObjectType(struct_def.name)

Can you make it a template that takes anything that has a std::string name field?

Its unfortunate that ConvertCase requires an input case since we're not generally very careful about what that should be. I made the default input case in namer.h CamelCase since it worked in Rust, which I started with but it turns out that's not consistent and has to be configured. (sigh).

Well, it needs someway to know where the breakage occurs in the input string to output to all the other types. I noticed this complexity too when switching to ConvertCase. Its painful now, but I think it is for the better.

In some languages, many kinds of symbols are all the same case style. I have to manually check if namer_.Variable vs namer_.Field should be used because if I made that mistake... they have the same style so the generated code is identical. There may be latent bugs that only appear when users try to customize the style. We don't yet have tests that can capture this.

Yes, our coverage is spotty and that makes refractorings harder.

One way you can test is it to modify each function (Variable and Field) to append some magic string and see where it outputs it in the generated files.

As expected, this effort is revealing a lot of subtle inconsistent edge cases that accumulated across many contributors, languages, and years.

Yep, I hear you.

There's inconsistent silliness like escaping keywords before converting case which has to be configurable

Since we are escaping correctly in the namer, can we just remove all explicit escaping in the generators? If there is a diff in the output, then those are probably real things we fix.

The style of the code generation varies a lot and how that aligns to Namer's style-guide-inspired kinds of symbols is not always clear.

I'll try to label what I think should be deprecated but changing the generated code is a breaking change for users so concentrating the technical debt into Namer and its config might be okay long term.

Yes, I would try to move all the odd cases into Namer so they are in one location. I am find with even name_.Variable_overrideForRust() that have custom logic to preserve the formatting it currently uses. That way its documented by functions that we can refactor easily in the future.

We're not consistent about usage of CodeWriter or simple strings, how indentation is handled... there's a lot of simplification work that can happen - though its unclear to me what the 'forcing function' will be to make us do that.

I would worry about those things later, lets just focus on the naming aspect.

Aside from addressing the keyword escaping bugs and enabling non-google style guides, I think this effort will at least make a dent in making the codebase more consistent

100%

Can you make it a template that takes anything that has a std::string name field?

That would work though I think I'd prefer to be more specific, e.g. EnumVariant(const EnumDef&, const EnumVal&) so we couldn't stick in a StructField and StructDef - I'm willing to be more verbose for type safety.

Since we are escaping correctly in the namer, can we just remove all explicit escaping in the generators? If there is a diff in the output, then those are probably real things we fix.
...
Yes, I would try to move all the odd cases into Namer so they are in one location.

std::transform(name.begin(), name.end(), name.begin(), CharToLower);

Here's a fun one. Swift escapes keywords after converting case, which is great (I just need to add a config flag for this), but the casing is weird; it turns the variant name to lowercase before converting it to camel case. I guess I have to add a LegacySwiftVariant method to encapsulate this.

As expected, this effort is revealing a lot of subtle inconsistent edge cases that accumulated across many contributors, languages, and years.

I can imagine! This has grown very organically, in many cases people finding spots in a language codegen that missed the right name generation long after the initial port.. nice to finally have that all centralized and cleaned up!

TS done with #7488

After upgrading to the latest FlatBuffers version, I noticed that some generated names in Go were changed and I bisected that to #7150. In particular, field names like FooID remained FooID before but become FooId now. Admittedly, we were not following the style guide at https://google.github.io/flatbuffers/flatbuffers_guide_writing_schema.html and we should have used foo_id instead of FooID in the schema.

Our company has been using FlatBuffers for Go since 2016 and this is the first time AFAIK that we get a backwards incompatibility in the generated code, so this feels like a major issue for us. So I'm wondering if you consider this a regression to be fixed or is it just "tough luck, you should have followed the style guide"?

Since we don't want to change all our own code which uses FlatBuffers, we'll probably work around this change by changing the field names in the schema to get the output that we want (foo_i_d does become FooID in Go). I read some issues and discussions here, #7111 would probably help us depending on how that turns out.

commented

@jdemeyer this would be considered a bug, the namer work I started is not intended to be backwards incompatible, just standardize existing behavior which varies across languages by making it defined by the NamerConfig that you see in #7150. Unfortunately, by Hyrum's law, this is a really challenging task.

I think this might be resolvable by making ConvertCase(foo_string, input_case, output_case) first check if foo_string is valid output_case and do nothing in that situation. This might affect other generated code in unexpected ways (again, Hyrum's law).

@dbaileychess unfortunately, I don't really have the 20% time to finish this effort at the moment so you might want to reassign this one.

Part of the problem is that the Go naming convention is UpperCamelCase with acronyms in all caps (for example ServeHTTP) and there is no real way to indicate that in snake_case: you would need to write serve_HTTP. While that looks odd at first sight, we could add a new Case type kUpperCamelACRONYM which would keep the all-caps parts in all-caps. I'm mostly just brainstorming here, not saying that this is a good idea or worth the effort.