dotnet / csharplang

The official repo for the design of the C# programming language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simple encoding of unions in C#

MadsTorgersen opened this issue · comments

Simple encoding of unions in C#

Unions are a feature where the runtime representation matters a lot to the trade-off between expressiveness and performance. Here's a simple way to encode type unions that are very expressive, and tagged unions that are type unions instead of an alternative to them. This approach trades off performance compared to others, notably around allocation, in exchange for simplicity and expressiveness.

It might just be the right approach for us! But regardless, it may serve as a useful endpoint of the spectrum, and we can use it to measure any comparative loss of expressiveness, elegance and unification incurred by lower-allocation alternatives.

I won't dwell on syntax. I'll adopt some suggestive syntax, but not discuss it. The focus is on the underlying structures. I lean heavily on the great work of the Discriminated Unions Working Group.

Type unions

A "type union" corresponds to a union of a set of types, as in union (string | int | bool). A value belongs to the union if and only if it belongs to (at least) one of the constituent types - in this case string, int or bool.
Type unions can take the form of anonymous unions in e.g. method signatures, but can also grow up to become declared unions when circumstances make it preferable, e.g. because of verbosity, repeated use or the need to add additional members - similar to how tuples can grow up to be declared as struct records.

A declared union type could look something like:

public union Result : (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}

Declared unions can't have additional state, but function members can operate on the value of this itself.

Runtime representation

A union type lowers to a struct type which wraps the single object field __value containing the value of the union. The struct declaration will contain some mix of members and metadata to tell the compiler what the constituent types are, and enable it to implement union-specific semantics. In the case of declared unions, any declared function members are lowered to make use of the underlying __value instead of this. E.g.:

[Union(typeof(string), typeof(int))]
public struct Result 
{
    // Generated members
    public object __value;

    // Declared members
    public bool IsEmpty => __value is null or "" or 0;
}

The exact format of the metadata is TBD - the [Union] attribute is just for illustrative purposes.

Anonymous unions lower to a declaration with a compiler-generated name (say __Union1), and the compiler is free to reuse declarations for equivalent union types across the same assembly.

If any of the constituent types are value types, those values will be boxed into the __value field. There is no need to separately record the type of the contents of __value.

If all the constituent types of a union have a shared base class, we might be able to use that instead of object as the type of __value.

This simple layout means that all unions have the same representation in memory, so that they are easily interchangeable, and so that generated types for anonymous unions are optimally shareable.

Also by being a single reference field they are small enough to behave atomically and not be subject to "tearing"; i.e., inconsistent state caused by multiple threads mutating simultaneously.

Finally, by using a reference field for the value, declared union types can be recursive, containing other instances of the same type.

Type testing

The usual mode of consumption for a type union is through a type test, e.g. a type pattern, to see what type its value actually has.

if (result is string s) { ... }

The generated code for this is trivial; just redirect the type test to the underlying __value:

if (result.__value is string s) { ... }

The runtime already knows how to test the type of the underlying member. No need to invent a new encoding.

Implicit conversions

This runtime representation allows us to offer extensive structural conversions between unions, whether declared or anonymous.

Where two union types U1 and U2 are deemed equivalent by the compiler, there is an implicit conversion between them in both directions. (This isn't quite an identity conversion, because unfortunately it won't apply to arrays or generic types over union types).

Where U1 is deemed a subtype of U2 by the compiler, an implicit conversion exists from U1 to U2.

In addition, an implicit conversion exists from each constituent type to the union type, and from the union type to any shared base type of its constituent types.

class B { }
class C : B { }
class D : B { }
union (C | D) u1 = new C();     // C --> union (C | D)
union (D | C) u2 = u1;          // union (C | D) <--> union (D | C)
union (C | D | string) u3 = u1; // union (C | D) --> union (C | D | string)
B b = u1;                       // union (C | D) --> B

The runtime implementation of these implicit conversions is trivial, since it can just wrap the __value of the source in the target type:

__Union1 u1 = new () { __value = new C() };
__Union2 u2 = new () { __value = u1.__value };
__Union3 u3 = new () { __value = u1.__value };
B b = (B)u1.__value;

The declaration of u2 might in fact lower like this instead:

__Union1 u2 = u1;

Since the compiler would have realized that union (C | D) and union (D | C) are equivalent anonymous unions and might have reused the same lowered declaration for both.

The declaration of b might look like this instead:

B b = u1.__value;

Since the compiler would have realized that B is a shared base class of C and D, and might therefore have declared the __value field of __Union1 to be of type B instead of object, thus obviating the explicit cast.

Tagged unions

The general idea of tagged unions (or "discriminated unions") is that each "variant" or "case" of a union has a tag, as well as some contents (zero, one or more values of given types). A given tag can then be matched with a pattern that is syntactically identical to a type pattern:

return result switch
{
    Success(var outcome) => ...,
    Failure(var exception) => ...
}

What if this isn't like a type pattern, but is a type pattern? Consider the following declarations:

public struct record Success(Outcome outcome);
public struct record Failure(Exception e);
public union Result : (Success | Failure);

Assuming result is of type Result, this makes the above switch expression work exactly as expected!

The proposal then is that we unify tagged unions into type unions by saying that a tag is simply a named wrapper type for zero, one or more values. You can deconstruct them immediately upon matching, but you don't have to. As types they are already an existing concept and first class citizens in the language. No new machinery or semantics needed.

Of course, a union type would be free to mix tagged and untagged constituent types, and it might well make sense to do so.

This is not to say that there shouldn't be convenient syntax directly supporting tagged unions, though perhaps it, too, should be more integrated into the syntax of type unions. The main point is that, whatever the syntax, tags, semantically speaking, are just types.

Tag type declarations could certainly be nested within type union declarations, so that they're proprietary to the given union type and don't "pollute" the top-level type space. If there's a dedicated syntax for tagged unions, maybe it generates the tag types nested inside of the union type by default. This might be paired with a feature like #2926 to keep the pattern matching against nested types elegant.

Performance

The main performance downside of this approach to unions is the frequent need for allocation. Value types need to be boxed, and "tag types" would also need to be allocated to go into the reference field of a union type. This seems particularly glaring when talking about a small union of a few small value types.

There are probably some mitigations possible without deviating from the format. For instance, common boxed values (true, false, -1, 0, 1, 2,...) could be cached and reused. But fundamentally, most value types would be boxed most of the time.

Non-allocating alternatives

On the other hand, non-allocating alternatives have performance downsides too. Trying to represent a union value as a struct that has a field for each possible type gets very space inefficient. While some packing is probably possible, it is complicated, as it must respect the runtime's need to keep references and non-reference data separate. So unions of many and/or large value types could become quite wasteful (lots of zeros in memory!), expensive to copy around and complicated to pack and unpack to underlying values.

On top of that, every union type would have a unique layout. Even if two of them are structurally related, any interop between them, such as implicit conversion or equality comparison, would be quite complicated and expensive, or maybe outright prohibited.

Recursive types

We should note that F# represents unions as classes by default. While different in detail from what's proposed here, it leads to a similar amount of allocation. In F#, discriminated unions are used - indeed usually preferred - as an alternative to class hierarchies to describe your data model. While you can explicitly ask for struct unions in F#, those are quite limited in comparison, partly because they cannot be recursive.

If you're doing serious data modeling, chances are you're going to need recursive types - where a value can directly or indirectly contain another value of its same type. In order to represent that, you need something to be a reference type: Value types cannot contain themselves! If union types don't support that directly, you're going to have to go out of your way to explicitly introduce reference types elsewhere. Your code gets more complicated, and the end tally of allocations will end up much the same.

Middle ground?

It's possible that we could pick another struct representation that is still uniform across all union types while avoiding a large portion of allocations. For instance, we could standardize on a scheme similar to https://github.com/JeremyKuhne/ValuePrototype, which in addition to a reference field keeps a certain number of bits on the side for storing small enough values without boxing. Perhaps we could expand the scheme to also efficiently represent "tag types" without extra allocation in many cases.

Compared to the single reference approach, such a scheme would make all union values bigger (perhaps twice as big), taking up more inline space and costing more to copy, but at least not in a way that grows proportionally to the number of constituent types involved. In addition, they would no longer be safe against tearing. But the approach would maintain the all-important property that the same value is represented by the same bits, regardless of which union type it is stored in. As such, it could still straightforwardly support many of the semantic benefits.

Conclusion

I think the described approach compares favorably to others in terms of expressiveness and inherent simplicity. As such it will be useful to measure against as we explore other approaches with different performance characteristics.

Declared unions can't have additional state, but function members can operate on the value of this itself.

It feels like there is overlap here between this and extensions. What's teh difference between a declared union:

public union Result : (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}

And a similar extension:

public explicit extension Result for (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}

?

Another area this doesn't seem to handle well woudl be something like (T | U). In this case, due to unification, it would not be possible in all cases to tell which 'case' you were in. That could be considered a positive, or a negative, but def seems good to call out. Specifically, while the union allows a choice of types, you can lose information on which of the choices were there when final testing is done.

The allocations suggested here I think would immediately disqualify this approach for any common monads like Option<T> or Result<T>.

On the other hand, #7016.

This is true. But this is still a particularly bitter pill when you have a union of only struct-types. Having (int | long) always be on the heap doesn't feel great. it feels slightly better if at least one of the union types is itself a reference type (e.g. (string | int)), but still not great.

I wonder if, like generics, there is a middle ground where there is a single object can be used for any reference type, where the value types get special treatment. Seeing your own 'middle ground' section gives me confidence that we could be very good for a large set of unions, especially the most critical types we'd want the runtime/ecosystem to unify on :)

I still feel that a simple implementation of tagged unions covers the majority of use cases. Trying to implement type unions as something native, without deep runtime support for it, feels like it would be a very leaky abstraction where it's easy to run into the warts. Type unions can just as easily be emulated with tagged unions, making it obvious to the developer exactly what type they're working with rather than trying to hide it behind a lot of chicanery (and hidden allocations).

Declared unions can't have additional state, but function members can operate on the value of this itself.

It feels like there is overlap here between this and extensions. What's teh difference between a declared union:

public union Result : (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}

And a similar extension:

public explicit extension Result for (string | int)
{
    public bool IsEmpty => this is null or "" or 0;
}

?

There's a lot of overlap. The extension version would probably lead to two nested wrapper types, but that might not be a real issue. Mostly I wanted to highlight that declared unions were easily part of the model, without implicating another large feature. 😊

Another area this doesn't seem to handle well woudl be something like (T | U). In this case, due to unification, it would not be possible in all cases to tell which 'case' you were in. That could be considered a positive, or a negative, but def seems good to call out. Specifically, while the union allows a choice of types, you can lose information on which of the choices were there when final testing is done.

Do you mean when T and U are type parameters?

I think there's a set of questions around how type unions interact with generics. Do they nest? Do they merge? I considered this to be somewhat orthogonal to the representation, so I left it out.

There's a lot of overlap. The extension version would probably lead to two nested wrapper types, but that might not be a real issue

Gotcha. Primarily i was approaching this from the perspective of "could we avoid having one of these if we have hte other?" So, if extensions cover this case, then it would be nice (for me) if we only did the extension route, keeping things simpler in the user facing model :)

Do you mean when T and U are type parameters?

Yes.

I considered this to be somewhat orthogonal to the representation, so I left it out.

WFM.

The allocations suggested here I think would immediately disqualify this approach for any common monads like Option<T> or Result<T>.

One goal of this proposal is to get away from "immediately disqualify" and get all options on the table. If we don't like the allocation profile of a given proposal, then that's part of the discussion. Maybe we can find other solutions short of discarding the whole thing out of hand. E.g., if memory serves me right, F# special cases Option<T> by using null to represent None.

In general, I really think F# is an important data point here. Discriminated unions are the main abstraction mechanism in the language, they use classes - and hence allocation - nearly everywhere, and yet the world doesn't come crashing down on them.

I'm not saying allocation doesn't really matter. It does! But other things matter too. With a proposal that optimizes for something else (expressiveness and simplicity), we can get a sense of the cost of a better allocation profile.

In general, I really think F# is an important data point here. Discriminated unions are the main abstraction mechanism in the language, they use classes - and hence allocation - nearly everywhere, and yet the world doesn't come crashing down on them.

I would just note that F# 4.1+ added struct tuples, struct records, and more to the point, struct unions. (Struct tuples were partially motivated by C# interop, but performance was also cited as a primary motivation.) So even in F#, allocations were enough of a concern to warrant fairly major language changes.

@MadsTorgersen

One goal of this proposal is to get away from "immediately disqualify" and get all options on the table. If we don't like the allocation profile of a given proposal, then that's part of the discussion.

Understood, and as an exploration then it's definitely appropriate to consider all options (pun intended).

For tagged unions I personally think it's worthwhile to have both reference- and struct-based versions, just like F# does. I'd agree that an allocating version is probably fine in most cases. For a DU that wraps a value in a context, my opinion is that an allocation would make it too expensive.

Thanks, Mads, this is an excellent middle ground proposal.

If the value that you are switching over is a boxed union (typed as object) can you ever match any of the constituent types? How would the generated code get at the value? Or is boxing of these unions disallowed or maybe special cased with runtime magic like with Nullable<T>? If so, can you unbox into a different union type?

If you are going to use records as implementations of tags w/ values for tagged union on top of type unions, and the value of the union is stored as a reference always, why not just use record classes instead of record structs and avoid boxing?

There is still an allocation, but it avoids other potential tearing issues when you use the constituent types by themselves.

If the value that you are switching over is a boxed union (typed as object) can you ever match any of the constituent types? How would the generated code get at the value? Or is boxing of these unions disallowed or maybe special cased with runtime magic like with Nullable<T>? If so, can you unbox into a different union type?

That is an excellent question! I think it applies to any approach that encodes anonymous unions as structs. I don't think the answer can be that boxing is disallowed, because in generic contexts we may not know that the thing we are boxing is a union.

One answer is that you can match it against a union type. If you use a union type in a type pattern, the compiler could generate code to bridge the gap between this union type and the one it actually is, if they are structurally compatible. That last bit might be tricky to establish.

Another answer is that all type tests will check to see, as a fallback, if the value is a union, and if so, dig out its contained value and to the test against that one. This is a place where having a shared representation across all unions can really help, actually!

In general, there seems to be a slew of open design questions around what to do when we don't know that we're dealing with a union, e.g., because of boxing or generics. I just steered clear of all of those! 😄

If you are going to use records as implementations of tags w/ values for tagged union on top of type unions, and the value of the union is stored as a reference always, why not just use record classes instead of record structs and avoid boxing?

There is still an allocation, but it avoids other potential tearing issues when you use the constituent types by themselves.

Good point! I think you're right in general.

I used struct records partly because I was thinking ahead to the "Middle ground?" alternative, where if they come in as structs there's a chance that we can stash them without allocation.

Could we take a page from Nullable and have the runtime have special handling for the boxing side of this? That way you don't ever see the wrapper when boxed, only when projecting back into the struct-union form.

Another answer is that all type tests will check to see, as a fallback, if the value is a union, and if so, dig out its contained value and to the test against that one.

Wouldn't that make all type checks more expensive?

IMO, trying to dig a type union out of a box only sounds like another reason to not do type unions, at least not without deep runtime support that makes this entirely transparent (and a zero-cost abstraction) to the language.

Could we take a page from Nullable and have the runtime have special handling for the boxing side of this? That way you don't ever see the wrapper when boxed, only when projecting back into the struct-union form.

That's a neat idea. When boxing, the runtime simply removes the wrapper and produces the already boxed contents! This seems most realistic with the simple wrapped-object-reference scheme though, otherwise the runtime would need to know some gnarly details about how to unpack the contents of the union struct.

Wouldn't that make all type checks more expensive?

Yes, probably.

otherwise the runtime would need to know some gnarly details about how to unpack the contents of the union struct.

It seems like it would potentially be viable with the hybrid model as well. Runtime would def have to know about the special encoding there, but that doesn't seem so bad. This would at least make small-unmanaged-structs and reference types non-boxing, which would def be a win for me.

Apparently the current metadata is blocking us to have an efficient implementation of all those kinds of modern features. Another approach would be introducing metadata v3.0 to provide all the necessary runtime support for union types, extensions, const generics, variadic generics, HKT and etc. Then we will be able to have a high-performance no-workaround-needed straight-forward efficient implementation of those features.

An out-of-topic thing is that roslyn and BCL are opting-out some language features to save allocations, namely interface-based LINQ.

Then we will be able to have a high-performance no-workaround-needed straight-forward efficient implementation of those features.

To be explicit, it allows efficient implementation. The layout of union of structs still need to be defined somewhere, but it can be decided by the runtime layout algorithm.

The performance implications are really off-putting. I agree with the others that this sort of thing just desperately needs proper runtime support, otherwise it simply doesn't seem like a good idea.
@hez2010's comment describes what is almost certainly the right direction here.

Is it possible to solve the performance problem by lowering to something like IntPtr _value; byte _type;, then unsafe-storing stuff into _value - be it a reference, a small value type (the 99% of the use cases), and if it's a large value type, then box it and do the reference approach? Is that even feasible from the runtime perspective? Forgive my ignorance if it's a silly question...

PS. Off the top of my head, this is probably difficult from the perspective of GC tracking...

FWIW, I think Swift has a concept of embedding the payloads of enum cases when they are small, and the storage can be shared, and using a polymorphic reference type or something when they are large.

I've been thinking about this off and on for while and have broadly come to the conclusion that we need to decide on semantics and declaration forms before we deep dive into encoding.

The main issue at hand is this: what we're calling "type unions," in set theory, are just called "unions." What we're calling "tagged unions," in set theory, are called "disjoint unions". These are not the same set operations. You can, in some cases, construct one from the other. But I don't think that this is an argument for why we should prefer one over the other, how they should play in the language, or what semantics will mesh best with C#.

I see unions as being useful in small, simple cases, like we use tuples today. I see disjoint unions as being useful as top-level data types, like we use classes and structs today.

I'm not sure either is or should be mutually exclusive. And if I had to choose one, I would probably prefer disjoint unions, for much the same reason that I would probably prefer classes over tuples if I could only pick one.

I needed to build something like this to handle variant parameters, and came up with a fairly compact codewise solution for my common case which were smaller structs (e.g. int, long, DateTime etc).
Essentially, I used a nint field, and a Marker generic class for storage for small structs, and for larger structs, and class types I would store them in the object field. This meant my struct was always allocating 16 bytes, but it was a decent comprise.

The downside was to get good performance for nullable required some fairly ugly hacks, as I wanted to essentially treat it like int and int? would be represented the sameway in the variant object.

Obviously, it would be great that instead of having to call value.Value, the runtime would just be able to box it to the correct type. And similiarly instead of having to check value.Has.Value we could just do a null check.

Here is an older verison of my solution::
https://gist.github.com/mburbea/7164f02722c5ac3a311a72a76c73c0e5

To elaborate on #7544 (comment)

I think there are a couple problems that crop up in the encoding that lead to important semantic discussions.

In particular, @mattwar raised the very good question, "how does this work if the union is boxed"? One answer is, the runtime could do the indirection.

But I think that's putting the cart before the horse. My inference of why we would need the runtime support is to keep the Result wrapper. And it sounds like we would want to keep the Result wrapper because it would have some unified syntax between unions and sum types.

But I think there are other questions in this area. The question of whether or not we should support value sum types, even if they are second class, like F#, is still relevant. But I also think generic definitions aren't obvious in this system. Consider

union Tree<T> : (Leaf(T Value) | Node(T Value, Tree<T> Left, Tree<T> Right);

This seems perfectly reasonable as a sum type. On the other hand, consider:

union IntOrList<T> : (int | List<T>);

This is still OK, but starts to raise some questions. For instance, the declaration form is now mandatory. There's no place on the anonymous form ((int | List<T>)) for generic type parameters. So you must define this at the top level. That removes a bit of the nice symmetry.

But then, it's worth noting that the sum type form doesn't have such a symmetry either -- it always requires the declaration form and has no anonymous form.

There's also some weirdness about the Result lowering -- if the wrapper type is generic in this formulation, then there are type identity questions around different instantiations. Consider the overlapping case:

union TOrU<T, U> : (T | U);

Is TOrU<string, int> assignable to TOrU<int, string>? Default C# rules would presumably say "no", and I think for good reason. But under general set semantics, the answer should be "yes." The disagreement here is from the declaration form enforcing nominal typing semantics onto a form that otherwise would not have such rules.

In both of these instances, I think you run into cases where fusing the syntax starts to push towards semantics that either have difficult-to-resolve conflicts, or just outright don't make sense. So once again, this starts to feel a little like tuples. In the anonymous form you have simple expressions, but if you want a more complex data type, you use a record.
Notably in C# we don't reuse the anonymous form and say something like,

record Point : (int X, int Y);

We probably could do something like that if we went back in time and reworked these from the start, but we just didn't choose that path.

If we decide that we like the abstract semantics where sum types can have generic parameters but unions can't, we can have that by separating the declaration forms. And if we want to require sum types to have top-level declarations, but don't want that for unions we can have that as well. And if we want to support value sum types but not value unions, we can do that later.

Combining the forms into one super-rich declaration form, and then either ending up with weird semantics, or choosing semantics we wouldn't otherwise like, feels like pressure to unify into a neat implementation package. If we just split the concepts, we can better align with desired semantics and pick the appropriate representation for each of them.

Moving this here from the other thread.

[Tearing] There isn't a simple fix here, just a set of tradeoffs.

Thinking out loud.

  1. The runtime would automatically box these values according to a two simple rules:
    a. If the type appears in a ref type, then it must be boxed.
    b. [edited] Unless the field is lifted to a readonly field (i.e. it exists as an immediate readonly field on the ref type, or is arbitrarily nested in struct fields that are all readonly)
  2. For methods that take struct enums by ref (ref, in, out), either:
    a. Likely least overhead: the runtime would have to introduce implicit const generics on all methods where struct enums are taken by ref (ref, in, out), the generic const bool indicating whether its corresponding parameter is boxed.
    b. A simpler solution: use a fat pointer.

(1.a.) is the sum of how data gets on to the heap - while I could be wrong, the actual ruleset couldn't be that much more complicated.

(1.b.) is an optional opportunistic optimization.

(2.a.) This could cause the amount of machine code to explode but, to contradict myself, it would only occur for methods where enum structs are used in different way.

(2.b.) A pointer to the start of a variable-length struct. The first word would contain a tag, which would be Tags ∪ { PointerTag } (where PointerTag is Tags.Max + 1). The second word onwards would either be a pointer to the actual tagged union in the heap, or the start of the values of the tagged union on the stack. This works because tearing isn't a concern for values on the stack.

What really sucks about my idea is that arrays would necessarily contain pointers, I considered this workaround but it doesn't actually work:

struct ReadonlyContainer<T> {
   public readonly T Value;
}

void Foo() {
  var arr = new ReadonlyContainer<MyEnum>[...];
}

As for how to represent them in C#, I think that this "accidental" approach using FieldOffset is fraught with surprising behavior. How about something like this:

[TaggedStructLayout] // Considered LayoutKind.Tagged, but that would likely break some existing code somewhere
// Only valid on readonly structs
readonly struct MyEnum {

[TaggedStructMember(TaggedStructMemberKind.Tag)]
readonly byte _tag;

[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly int _a;
[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly String _b;
[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly decimal _c;

// Unused by enums, but no reason why this couldn't exist
[TaggedStructMember(TaggedStructMemberKind.Value)]
readonly decimal _e;

// TaggedStructMemberKind.Accessor set by this attr ctor
[TaggedStructMember(0, nameof(_a))]
public int A => _a;

[TaggedStructMember(1, nameof(_b), nameof(_c))]
public (string, decimal) B => (_b, _c);

// Bonus feature for users building these by hand/source-generator:
// Error: `_a` does not appear in the list of tagged fields for this accessor.
[TaggedStructMember(2, nameof(_b))]
public int C => _a;
}

That could ship in the compiler today with no runtime support, and the runtime could add support for it later.