A builtin converting the outer `Data.Constr` to a SOP for fast `case`?

Question

A builtin converting the outer `Data.Constr` to a SOP for fast `case`?

effectfully opened this issue 3 months ago · comments

Intro

There exist multiple ways to create data types in Plutus Core:

Scott-encoding, long retired by now
sums-of-products (SOPs)
via the Data built-in type

SOPs have by far more efficient pattern matching than Data (even with the latter properly sped-up). However at the script interface level we have Data and hence to get that fast pattern matching on SOPs we have to convert Data to a SOP first and this is such an expensive operation that in a lot of cases it completely defeats the purpose as the cost of conversion may significantly outweigh the benefit of having faster pattern matching.

A prime example of Data being generally preferable over SOPs is ScriptContext, since values of this type are huge and a script doesn't need to access all of them. This problem is discussed in detail in this CIP, which proposes to make ScriptContext into a SOP instead of the current Data. This is however problematic in a number of ways:

it's a breaking change to the API
alternative toolchains may not (and some likely do not) deal with SOPs in any way at all as they may find only using Data sufficient, particularly since using Data is simply faster in many cases

What if instead of trying to force SOPs upon everybody whereever possible we simply make pattern matching on Data-encoded data types more efficient, making it competitive with pattern matching on SOPs?

High-level idea

There's one deceptively simple way of making pattern matching on Data-encoded data types faster: just convert a Data.Constr node into the respective Term.Constr node and case on the latter instead using the fast case primitive.

This is how Constr is defined for Data:

data Data
    = Constr Integer [Data]
    | <...>

This is how Constr is defined for UPLC Term:

data Term name uni fun ann
    = Constr !ann !Word64 ![Term name uni fun ann]
    | <...>

Any Data.Constr is embeddable into Term via

converting the Integer value into Word64 when possible (this doesn't work with certain interpretations of the Integer value such as this one, but we ignore those within the scope of this discussion and assume that each Data object represents a well-defined SOP term with Integer value being indices of constructors of the encoded data type)
converting the [Data] value into [Term name uni fun ann] by wrapping each Data with the Constant constructor of Term (plus some additional plumbing that we won't get into)

For example

Data.Constr 0
    [ Data.Constr 1 []
    , Data.I 3
    ]

will become

Term.Constr () 0
    [ Term.Constant () (Some (ValueOf <...> (Data.Constr 1 [])))
    , Term.Constant () (Some (ValueOf <...> (Data.I 3)))
    ]

(where Some and ValueOf are the additional plumbing that we don't get into, since it's largely irrelevant and will likely change anyway)

I.e. the outer Data.Constr becomes a Term.Constr and the contents of the former get embedded as-is, allowing us to case on the new value.

We're going to call the built-in function implementing this behavior constrTermFromConstrData.

Differences with faster pattern matching for built-in types

#5711 elaborates on the approaches for speeding up pattern matching for built-in types, which may sound directly related to the matters described in this issue, however the two discussions are entirely tangential. Making pattern matching for built-in types faster impacts not only Data, but also lists and pairs and potentially other built-in types (maybe even Bool). However without constrTermFromConstrData there's no way to access all the arguments of a constructor at once in a single case expression. I.e. without that builtin we'll have to keep doing what we do now, namely recurse on ds from Constr i ds and extract arguments one-by-one using pattern matching on built-in lists. That recursion happens on the Plutus side, which is very inefficient, plus it requires a linear amount of builtin calls (regardless of how pattern matching for built-in types is implemented) while with constrTermFromConstrData only one builtin call is required (we'll still need a linear amount of work to feed the arguments of the constructor to the function from the respective branch, but that is supposed to be much more efficient than doing builtin calls).

Implementation details

Feel free to skip this section, it's here for completeness.

In practice, we don't embed Data.Constr into Term.Constr, because at runtime built-in functions receives values, not terms, so we need to embed Data.Constr into the type of values that the specific evaluator uses. It's the same logic as with constants and we use a very similar machinery.

So a more accurate name would be constrValueFromConstrData or something, but the users don't really care about these trivialities, so we're going to stick to constrTermFromConstrData, since it's a decent enough and more specific name, plus it's easier to explain the high-level details this way and constrTermFromConstrData kinda looks better in a program ("value" is way too overloaded).

The Haskell denotation of the Plutus builtin looks like this:

constrTermFromConstrDataDenotation
    :: KnownTypeAst TyName DefaultUni sop => Data -> BuiltinResult (Opaque val sop)
constrTermFromConstrDataDenotation (Constr iInteger ds) = do
    iWord64 <- integerToWord64BuiltinResult $ fromValue @val iInteger
    pure . toConstr iWord64 $ map fromValue ds
constrTermFromConstrDataDenotation _ = evaluationFailure

which follows the logic outlined above: the Integer value becomes a Word64 and the Data arguments get turned into values using fromValue (where Value means a Haskell value, not what we've been discussing so far, told you it was way too overloaded).

A partial implementation can be found in this PR.

Typing issue

The runtime semantics of this builtin are trivial, but what type is that builtin supposed to have? Well, a good start is

constrTermFromConstrData : all (sop :: *). data -> sop

which says that the builtin converts data values to an arbitrary type that is named sop, but can be really anything. But this is an overly permissive type signature: we don't want this builtin to claim that it can produce anything, we only want it to claim to produce SOPs. We could try thinking of somehow insisting that sop is indeed a SOP (I've no idea how we could achieve that), but even then, we don't want it to be any SOP, we want it to be some very specific SOP, for example:

constrTermFromConstrDataExample : data -> SOP [[data, data], [], [data]]

If our builtin was limited to [[data, data], [], [data]], then we could check at runtime that

ds in Constr 0 ds has two elements
ds in Constr 1 ds has zero elements
ds in Constr 2 ds has one element

and that builtin would be reasonably typed and safe to have.

But how could we possibly generalize this example version of the builtin without the ability to quantify over the shape of SOPs and constrain those shapes to have a runtime representation, so that it's possible to check the Data.Constrs coming in against that representation? In Haskell we could write something like

constrTermFromConstrDataExample :: forall shape. KnownShape shape => Data -> SOP shape

but Plutus isn't Haskell and we can't turn such a Haskell function into a built-in one as we can't quantify over shape, let alone constrain it with a type class.

Maybe with dependent typing we could somehow connect the given data value with its expected shape, but this doesn't sound particularly nice either and we don't have dependent types in the general Plutus any way, let alone in its builtins.

Builtin families

It is clear that enumerating all practical (let alone possible) instantiations of constrTermFromConstrData as individual builtins isn't gonna work: we're not adding hundreds/thousands of new built-in functions. What we can try however is making constrTermFromConstrData a family of built-in functions rather than a single builtin, so that constrTermFromConstrDataExample is just one member of the family. I.e. we'd have a new constructor of DefaultFun that unlike all other constructors takes an argument with the argument being the shape of the final SOP:

-- | Each element represents the number of 'Data's in a product from a sum of products. E.g.
--
-- > [2, 0, 1]
--
-- represents
--
-- > SOP [[data, data], [], [data]]
type SopShape = [Word64]

data DefaultFun
    -- Integers
    = AddInteger
    | SubtractInteger
    <...>
    | ConstrTermFromConstrDataExample SopShape

This may rule out some use cases (perhaps if one decides to compile an extensional type theory with W-types to Plutus or something), but this seems so insignificant that it's probably not even worth any additional discussion. How often does the programmer not know the arity of a constructor or the number of constructors of a data type?

The real problem is that adding builtin families invalidates all of our assumptions about builtins, which are, in the language of Haskell constraints, Enum, Bounded and Ix. If you have an infinite number of builtins, you can no longer put something for each of them in an array. You can't put their type signatures in golden files or print them to the user. And we do these kinds of things all over the place.

So while retrofitting builtin families into Plutus seems possible in principle, the costs of such an endeavor would be huge and so would be the risk of screwing something up along the way.

Singletons

Introducing builtin families sorta amounts to allowing for type applications to be unerasable for builtins. Which is hard in our existing system, but there's a common way to emulate unerasable types: singletons. Normally singletons require GADTs (which are hard and unergonomic in Plutus Core (see examples) to the point that we prefer to think of them as entirely unsupported), however in our case we might get away with only using a GADT on the Haskell side.

The idea is that we implement the constrTermFromConstrData builtin such that it has the following Plutus type signature:

constrTermFromConstrData : sopOfData sop -> data -> sop

where sopOfData is a built-in type that restricts sop to be a sum-of-products-of-data (such as the SOP [[data, data], [], [data]] example from the above). Each value of a built-in type comes with a runtime type tag, which in this case we can use to establish the connection between the type information and the runtime content of sopOfData thereby essentially propagating type info to the runtime where it can be used to ensure the safety of the conversion from Data to sop.

Assuming it's going to work out, we'll still run into a problem: doing checks at runtime is expensive. Not only will we need to construct and pass around that sopOfData singleton, we'll also need to match on it to ensure that it encodes the exact shape of the given data object. All for the sake of making our type system better. Would, say, Aiken users care about those checks slowing their smart contracts down? I doubt it.

What exactly breaks if we add the unsafe `constrTermFromConstrData`?

Adding the unsafe constrTermFromConstrData will make it possible for a well-typed case expression to fail with an error message like "expected n arguments to constructor C but got m" (I'm shooting for the unreachable ideal error message here). Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.

Is this a big deal? I don't know. What I know is this is kinda a not-too-unsafe form of unsafeCoerce. Right now one can implement a rather boring safeCoerce function that returns its argument back but with a different type:

safeCoerce :: (ToData a, FromData b) => a -> b
safeCoerce = fromData . toData

One might use if a a type synonym to b for example. If it so happens at runtime that a isn't definitionally equal to b, then it'll be an immediate evaluation failure (unless we have some really weird ToData/FromData instances, which I hope we don't).

Adding the unsafe builtin will allow us to define

unsafeCoerce :: ToData a => a -> b
unsafeCoerce = constrTermFromConstrData . toData

that does not immediately fail with an error at runtime in case a and b aren't definitionally equal. Instead, it'll return a term that fails being cased on (if the other parts of the case-expression are well-typed themselves).

So how bad is this? Quite bad actually! Consider the following UPLC expression (pseudo-syntax):

case constrTermFromConstrData d of
  \x y -> y

It converts a Data value d to a Term, i.e. a Constr with a bunch of fields, and then applies the \x y -> y function to the two fields (recall that in UPLC branches of case expressions are merely functions). But what if constrTermFromConstrData d returns a Constr with three fields rather than two? Well, then the function gets applied to too many arguments and depending on what y is we'll likely get an error sooner or later. Likely, but not necessarily: what if y is a function? Then it gets fed the extra argument that appears last in the overapplied Constr and the semantics of the term completely derails from what was originally intended by the programmer, meaning at that point they're at the mercy of undefined behavior.

The opposite situation when Constr has only one argument is equally bad. In that case the \x y -> y gets applied to only one argument and the whole expression reduces to \y -> y and if evaluation stops here, then that's evaluation success! Certainly something that the programmer did not intend to end up with.

Note that even if we knew how to give a type to constrTermFromConstrData, we'd still have all these issues because the on-chain language is untyped. Even if we can guarantee that our toolchain never produces such mess, we cannot guarantee that other toolchains don't, particularly those that target UPLC directly without using any type system.

Conclusions

The unsafe constrTermFromConstrData is a horrible footgun that we should not add to the language, I believe.
Maybe we can make singletons perform reasonably fast, this should be investigated.

Michael Peyton Jones · Answer 1 · Mon Feb 19 2024 19:44:36 GMT+0800 (China Standard Time)

constrTermFromConstrData : all (sop :: *). data -> sop

So apart from breaking the type system, the other problem with this is that there is no cheap way for the user to "make it safe". If you are given a Constr value of the wrong shape, you can't tell without pointlessly pattern-matching on the resulting SOP term. If you don't do that you just have to pass it around and wait for it to explode when you eventually do pattern-match on it (maybe this is fine?).

This is one reason why it would be helpful to be able to pass the shape in (although indeed that causes a lot of complexity).

Thinking about this, it's also notable that this doesn't let us do a super-fast fromData. Firstly, it gives us a SOP type which has fields of type Data, which is already awkward since your original type might not look like that, so you might need another type to handle it. Secondly you need to recursively process the fields. So you might end up with something like this:

data MyFoo = A Integer | B [ByteString]
data MyFoo' = A' Data | B' Data

instance UnsafeFromData MyFoo where
  unsafeFromBuiltinData d = case constrTermFromConstrData @MyFoo' d of
    A' d' -> A (unsafeFromBuiltinData d')
    B' d' -> B (unsafeFromBultinData d')

Which incidentally does check the shape by pattern matching immediately! Is this faster than what we do today? I don't know.

effectfully · Answer 2 · Tue Feb 20 2024 09:43:05 GMT+0800 (China Standard Time)

@michaelpj sorry, this wasn't ready for review, I was going to add more sections, which I now did, please check them out.

So apart from breaking the type system, the other problem with this is that there is no cheap way for the user to "make it safe". If you are given a Constr value of the wrong shape, you can't tell without pointlessly pattern-matching on the resulting SOP term.

How do you even know how to pattern match on it to ensure safety? Type info is erased, the builtin doesn't know anything about the expectations at runtime unless you make the type info available at runtime like with the singletons approach sketched in the OP. And that latter part is hella inefficient.

Thinking about this, it's also notable that this doesn't let us do a super-fast fromData. Firstly, it gives us a SOP type which has fields of type Data, which is already awkward since your original type might not look like that, so you might need another type to handle it. Secondly you need to recursively process the fields. So you might end up with something like this:

I thought the idea was to speed up the AsData approach rather than fromData? So that instead of the O(n) patterns for each data type with vastly inefficient nonsense like this:

unsafeDataAsConstr -> ((==) 0 -> True, [unsafeFromBuiltinData -> i])

we'd have a single neat match like

case unsafeConstrTermFromConstrData ex of
    -- 0th constructor with 1 argument
    Constructor0_1 (unsafeFromBuiltinData -> i) -> <...>
    -- 1th constructor with 2 arguments
    Constructor1_2 (unsafeFromBuiltinData -> a1) (unsafeFromBuiltinData -> a2) -> <...>

where Constructor*_*s compile to anonymous SOPs.

Performance of general fromData is surely important, but we'll discuss it separately when I finally create an issue about folding builtins.

Or do you think that only optimizing our AsData and Aiken's handling of data types doesn't cover the entire scope of this issue?

Which incidentally does check the shape by pattern matching immediately! Is this faster than what we do today? I don't know.

Yes, but we can't enforce that it's used only this way. Maybe we could wrap it in a CanBlowUp built-in type and say that the exposure to unsafety is limited by that guard?

Michael Peyton Jones · Answer 3 · Tue Feb 20 2024 18:14:20 GMT+0800 (China Standard Time)

How do you even know how to pattern match on it to ensure safety?

I didn't mean we can do it automatically, I meant that the user can do it manually. I.e. if am writing some code that uses constrTermFromConstrData and I want to be sure that I'm getting the type that I expect out of it, how can I do that? The only option is to pattern match, which will force the error (but also do extra work).

I thought the idea was to speed up the AsData approach rather than fromData?

Yes, it definitely speeds up AsData but I was thinking it would help both of them! Since fromData is pretty much just doing the same thing, but recursively.

we'd have a single neat match like

I think you don't need the anonymous SOPs (although that would work) because we can generate an appropriate type for each case:

data MyFoo = MyFoo Data
data MyFoo' = A' Data | B' Data Data

pattern A i -> MyFoo (unsafeConstrTermFromConstrData @MyFoo' -> A' (unsafeFromBuiltinData -> i))
pattern B i1 i2 -> MyFoo (unsafeConstrTermFromConstrData @MyFoo' -> B' (unsafeFromBuiltinData -> i1) (unsafeFromBuiltinData -> i2)

Michael Peyton Jones · Answer 4 · Tue Feb 20 2024 18:29:25 GMT+0800 (China Standard Time)

Adding the unsafe constrTermFromConstrData will make it possible for a well-typed case expression to fail with an error message like "expected n arguments to constructor C but got m"

No, you can get far more errors than that. We can produce any a, so you can do addInteger (unsafeConstrTermFromConstrData d) 1 or (unsafeConstrTermFromConstrData d) 1, etc. etc. So you can generate essentially all the machine errors that come from non-type correct programs. Of course, we handle those in the UPLC machine, so it's fine in that sense.

Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.

I think the issues with breaking the type system lie in breaking our assumptions. For example:

let !lies = unsafeFromConstrData d
let ~dead = let !x = case lies of ... in 1
in ...
---->
let !lies = unsafeFromConstrData d
let !x = case lies of ... 
let ~dead = 1
in ...

(this isn't quite right but I'm too lazy to get it just right: basically moving a case expression from dead code to live code)

Is this safe? Today we would say that it is: the case can't fail, so we can move it around freely. In the presence of unsafeFromConstrData it is no longer safe, because the case might fail, indeed, literally any expression might fail.

So essentially all of our optimizations would become sound "up-to-use-of-unsafeFromConstrData", which is fine but a bit sad.

effectfully · Answer 5 · Wed Feb 21 2024 05:16:30 GMT+0800 (China Standard Time)

I didn't mean we can do it automatically, I meant that the user can do it manually. I.e. if am writing some code that uses constrTermFromConstrData and I want to be sure that I'm getting the type that I expect out of it, how can I do that? The only option is to pattern match, which will force the error (but also do extra work).

Ah, right.

I think you don't need the anonymous SOPs (although that would work) because we can generate an appropriate type for each case:

That has multiple calls to unsafeConstrTermFromConstrData, will GHC pull them out and turn into a single call?

No, you can get far more errors than that. We can produce any a, so you can do addInteger (unsafeConstrTermFromConstrData d) 1 or (unsafeConstrTermFromConstrData d) 1, etc. etc.

I don't think this can happen? There's a definition of constrTermFromConstrDataDenotation in the Implementation details section above:

constrTermFromConstrDataDenotation
    :: KnownTypeAst TyName DefaultUni sop => Data -> BuiltinResult (Opaque val sop)
constrTermFromConstrDataDenotation (Constr iInteger ds) = do
    iWord64 <- integerToWord64BuiltinResult $ fromValue @val iInteger
    pure . toConstr iWord64 $ map fromValue ds
constrTermFromConstrDataDenotation _ = evaluationFailure

We do ensure that the given Data is a Data.Constr and toConstr can only return Term.Constr/CkValue.Constr/CekValue.Constr, so I don't think that

indeed, literally any expression might fail.

is correct, but maybe I'm missing something.

I think our biggest problem isn't failures though. We don't do any checking of whether the constructor has the expected number of arguments, we just feed them into the clause represented by a function. If there are too many arguments, we'll get a failure, but what if there are too few of them? The thing will evaluate and we'll get a function back, no failures, and that sounds really dangerous, because that can actually make the script succeed when it really shouldn't. Yeah, this part does sound like a serious issue with embracing the unsafety of the builtin.

Michael Peyton Jones · Answer 6 · Wed Feb 21 2024 18:25:26 GMT+0800 (China Standard Time)

That has multiple calls to unsafeConstrTermFromConstrData, will GHC pull them out and turn into a single call?

Dunno, but also we have our own CSE now.

We do ensure that the given Data is a Data.Constr and toConstr can only return Term.Constr/CkValue.Constr/CekValue.Constr

Right, so it will always produce a CekValue.Constr or fail, but a CekValue.Constr is not valid in many places! i.e. if the input Data is a Constr (so the builtin succeeds), but we invoke the builtin at (a -> b), then we will succeed in producing a CekValue.Constr and then fail when we try to apply that as a function.

If there are too many arguments, we'll get a failure, but what if there are too few of them?

Aaaaaaah. This is horrible. Yet another problem caused by not having case include explicit binders, I am increasingly regretting that.

So that means you can't even tell if it worked properly by pattern-matching on it...

effectfully · Answer 7 · Thu Feb 29 2024 07:41:01 GMT+0800 (China Standard Time)

Right, so it will always produce a CekValue.Constr or fail, but a CekValue.Constr is not valid in many places!

Right, thanks. We could handle that by hardcoding a typing rule for constrTermFromConstrData stating that it must be applied to a SOP type -- we have more leeway when it comes to checking types as opposed to connecting types with runtime behavior.

So that means you can't even tell if it worked properly by pattern-matching on it...

Yeah, I've now added a description of this issue to the OP. I'm gonna ask research folks to take a look at the whole thing unless you have a bright idea or something.

Michael Peyton Jones · Answer 8 · Thu Feb 29 2024 17:42:54 GMT+0800 (China Standard Time)

We could handle that by hardcoding a typing rule for constrTermFromConstrData stating that it must be applied to a SOP type

This would be way too restrictive: for the entirety of the time we are in a typed version of PLC the SOP types are hidden behind type abstractions, so not being able to use this on type variables etc. would be a big problem.

A builtin converting the outer `Data.Constr` to a SOP for fast `case`?

Intro

High-level idea

Differences with faster pattern matching for built-in types

Implementation details

Typing issue

Builtin families

Singletons

What exactly breaks if we add the unsafe constrTermFromConstrData?

Conclusions

What exactly breaks if we add the unsafe `constrTermFromConstrData`?