CCN: A path to True Nullability Schema

Question

CCN: A path to True Nullability Schema

captbaritone opened this issue a year ago · comments

TL;DR I see a path where CCN’s ? could be leveraged by smart clients to safely expose the true resolver-nullability of fields directly to product code.

Prelude: CCN Behavior Definition

Since Client Controlled Nullability (CCN) may have a different meaning to different people, I’ll start by specifying my hope for how CCN will work. The rest of this post assumes this behavior:

Under CCN, the ! and ? annotations would allow the query to override the schema nullability of a field within a selection for the purposes of the execution of that selection.

! means: Treat this field, in this selection, as if it were non-nullable
? means: Treat this field, in this selection, as nullable

In other words, for the purposes of executing a query selection, every place the spec refers a field’s schema nullability, it would instead refer to the field’s nullability within the selection, which may or may not have been modified by CCN annotations in the query. Beyond that, all error handling and null bubbling behaviors of the current spec would be unchanged. Note that this includes the fact that errors thrown by a ? field would still be included in the response errors metadata.

Client-defined resiliency

GraphQL’s current recommended approach to providing response resiliency in the face of resolver errors is to make fields in the schema nullable by default. Unfortunately, this has the effect of obscuring the true nullability of fields. Clients, and even users, can’t tell from the schema alone if null is expected as a possible value, or if the field will only return null in exceptional (error) cases.

In this world of nullable-by-default schemas, the Client Controlled Nullability (CCN) proposal is primarily a tool to add assertions via !. While this is a marked ergonomic improvement, these assertions must be added blindly, without knowing if null is an expected value or not. This is at best awkward and at worst dangerous.

However, CCN’s ?, opens up the up the possibility of a different mechanism to achieve request resiliency. One which avoids obscuring the true nullability of fields. Specifically, an approach where we shift expectation from “it is the server/schema’s responsibility to make requests resilient to errors by typing fields as nullable” to “it is the client/queries responsibility to make requests resilient to errors by annotating non-nullable fields with ?”.

With this approach to resiliency, the schema could specify the “true” nullability of the fields.

For simplistic clients, e.g. Curl, the client/user can now see the true nullability of each field in the schema and add the appropriate amount of resilience for their use case using CCN’s ?. In a sense this is the same as CCN’s ! applied to a fully nullable schema, in that the client is empowered to declare which fields it can manage without, and which fields it requires. We’ve “simply” inverted the default. Of course, defaults are tremendously powerful and this tradeoff should be considered carefully. See “The power of defaults” below.

The opportunity for smart clients

For smart clients, this approach can not only let users “see” the true nullability, it can actually let product code interacted with generate types that model this true nullability. I see this as a fundamental solution of the actual problem that CCN initially set out to solve.

If smart clients can transform errored fields into contained thrown exceptions, that would mean product code should never encounter a null value due to a resolver error. In that case, the types that the smart client generates for its fragments/queries could safely express the true nullability of those fields on the server.

This is something that we are currently, actively exploring for Relay. I’d encourage you to read the linked issue, but in short rather than containing errors with null bubbling, we contain errors with error boundaries. For cases where the user wants to imperatively handle the error case, they may add a @catch directive to the field which behaves very similarly to CCNs ?` and would hopefully some day be subsumed by it.

Note that compiler-based smart clients like Relay transform the queries/fragments defined by the user before sending them to the server. This means Relay can auto-insert ?s on all non-nullable fields, ensuring resiliency is the default behavior and we will always render as much of the UI as possible, given the data that the server was able to send.

So, Relay would use ? in two different was:

As a hidden implementation detail used to ask the server to not apply null bubbling
As a user-facing feature to allow components to locally handle errors instead of relying on error boundaries

A pattern not a feature

One appealing aspect of this vision is that it’s simply composed from existing, or at least proposed, GraphQL spec primitives. It does not require any additional spec changes, and can be optionally adopted by those who find it a good tradeoff.

Appendix/Caveats

This solution is not a silver bullet. It may not be viable for other clients, and even for Relay there are significant challenges that would need to be solved first. I propose it here more as a long-term vision than as an immediate next step. Here are list of concerns/caveats/complicating factors:

Missing Data

In Relay, there are actually two reasons that we type all fields as optional:

The field might return null due to error
The field might be missing due to normalization

To make Relay fields non-nullable by default, we’ll need to first provide a mechanism for Relay to handle refetching (or erroring) in the face of missing data. I believe missing data is a fundamental gap in Relay today and is deserving of a project to resolve that gap.

The power of defaults

Shifting responsibility from the server to the client makes it harder to enforce this best practice of resiliency. Opinionated smart client frameworks may be able take over the role of enforcing resiliency by auto-inserting ?s, but the story for simplistic clients is less clear.

Users will instinctively take the path of least resistance. If adding resiliency is extra work that is not forced upon them by the server or a client framework, it is likely that client code will tend not to go the extra mile to handle potential errors.

Error boundaries

This approach is dependent upon having a client architecture that allows product code to contain errors thrown during render. React Error Boundaries provide this primitive, but client architectures without such a feature may not have a clear path to adding explicit error handling, which is a necessary ingredient for this approach to work.

Even in Relay, explicit error handling has not yet been validated, though we hope to ship it to production soon.

Breaking changes

Another reason that GraphQL recommends that all fields be nullable, even if their current implementation is non-nullable, is that it allows us to turn a non-nullable field into a nullable field as a non-breaking change. This is especially important on mobile where clients live essentially forever. Being able to make a field nullable can be key to being able to delete code.

I don’t have a solution to this problem, but I am curious to learn how well it works in practice. Have users of this approach actually be able to routinely make fields nullable without breaking old clients? Are product engineers really designing apps that gracefully degrade in the face of any field being null? The convergent evolution of @required and CCN’s ! makes me wonder.

Worst case, the approach I outline here would only be viable for clients with a finite support window.

Alternatives to CCN

Our use of CCN to enable this new model, is more opportunistic than designed. CCN offers primitives that smart clients can leverage behind the scenes as a compiler implementation detail. The core behavior we really want is:

A schema that exposes the true nullability of fields, at the same time as…
An execution model that performs no null bubbling

This works because we can expect the smart client to intercept error fields before they reach product code, shielding it from nulls in non-nullable locations.

If we think this model is broadly valuable, it’s possible we would want to explore a more explicit mechanism to enable this execution model rather than simply allowing smart clients to fake this execution model via compiler-inserted CCN annotations.

Martin Bonnin · Answer 1 · Sat Sep 02 2023 06:21:57 GMT+0800 (China Standard Time)

Thanks for writing this 👍 . About your question:

[...] It allows us to turn a non-nullable field into a nullable field as a non-breaking change. [...] I am curious to learn how well it works in practice. Have users of this approach actually be able to routinely make fields nullable without breaking old clients?

Working on Android, I have anecdotal evidence this is not working in practice.

Quite the opposite actually: the moment the backend starts sending null for something that clients always assumed to be non-null, not only client will break but any preprod/CI check that could have detected this completely ignores it because the field was nullable in the first place and considers this compatible while, in fact, it's a big change of the initial contract.

So it's "technically not breaking" in theory but since clients don't know how to handle the null case and often don't prepare for it (maybe they just don't have a way to test it, maybe they're new to GraphQL, maybe they're on a tight deadline, etc...) then it becomes quite breaking in practice...

Benjie · Answer 2 · Sun Sep 17 2023 19:22:13 GMT+0800 (China Standard Time)

Excellent write-up and interesting idea.

This means Relay can auto-insert ?s on all non-nullable fields, ensuring resiliency [...]

Relay would need to auto-insert ?s on all fields, since any nullable field could become non-nullable at a later point as what is currently a non-breaking change; but with CCN any type change could become breaking.

As an alternative, I've wondered if simply tagging the operation as "everything is nullable" would be sufficient:

query MyQuery($id: ID!)? {
  viewer { ... viewerFragment }
}
# ...

This would, in my opinion, be much cleaner than adding ? to every single field. I find myself wondering if we actually want field-level ? at all, or if just a whole-query "don't worry, I'm smart enough to handle the errors myself" is sufficient.

Nullability in GraphQL mixes a few concerns. In input it's both "optional" and "nullable" (which are distinct concepts, but GraphQL merges together). In outputs, it's both "can be null" and "is an error boundary". Your idea of exposing the "true" nullability is an interesting one, and would be an improvement for many developers, but figuring out where to add the ? (if not everywhere, as would be the case for Relay) would be challenging. Referential integrity in a database might guarantee that if a post exists, then the author of the post exists. But as you scale, you might move the post and the author onto different microservices, and suddenly the fact you could fetch a post no longer means you can fetch the author. The client has no way to "know" this is a possibility, other than the nullability of the field. It almost feels like there should be a different indicator, like "seems like a good place for an error boundary". (Not "field can raise error" because everything could error! More like: does this make sense as a boundary point for errors to stop at?)

Jordan Eldredge · Answer 3 · Mon Sep 18 2023 13:02:42 GMT+0800 (China Standard Time)

@benjie Thanks for taking the time to read, and for your thoughtful feedback. I've put together some thoughts in reply:

I find myself wondering if we actually want field-level ? at all, or if just a whole-query "don't worry, I'm smart enough to handle the errors myself" is sufficient.

Yes, agreed. CCN's ? was what first got me thinking "what if the client could just opt out of null bubbling?", but it would just be a means to an end. I've since made a discussion post in the main Working Group Repo True Nullability Schema which approaches the idea without assuming any connection to CCN. At the end of the day, this approach just requires some way for smart clients that can handle errors client side to ask to opt out of null bubbling.

In outputs, it's both "can be null" and "is an error boundary".

Right. And if we can opt out of null bubbling, then we are left with jus the "can be null", which is much easier to reason about.

Regarding the potential for a once non-nullable field to become nullable is valid. I call out a version of that in the "breaking changes" section. However, data semantics are always going to be change over time in all sorts of ways. I'm not sure a field becoming nullable is a special case of this. It is our job as schema engineers to anticipate these changes when they are likely (making a field nullable in anticipation of a likely re-architecture) or managing the work of a breaking change (deprecating a field and replacing it new a new nullable one).

It almost feels like there should be a different indicator, like "seems like a good place for an error boundary".

My hope is that we can start to rely more heavily on the errors metadata to represent errors which would obviate the need for error boundaries in the schema all together.

In other words, in this new non-bubbling mode, the server makes a different, slightly weaker, assertion. Rather than "The data portion of my response is always type-safe with regards to the schema" instead it says "The data portion of my response, ignoring any fields referenced in the errors metadata is always type-safe with regards to the schema"