Use a simpler PAE

Question

Use a simpler PAE

MarkLodato opened this issue 3 years ago · comments

In 0.1, we choose PASETO's PAE because it was already written. But it is does require doing some integer-to-bytes conversion and a need to be careful about sizes and endianness. Here is an example where a potential customer was scared off by the perceived complexity of PAE: sigstore/cosign#214 (comment)

For 1.0, perhaps we should choose a simpler function that is easier to describe and implement. Here is my proposal: concatenate the following, separated by a single space:

"DSSEv1" (or whatever we choose in #16)
byte length of UTF8-encoded type ID, encoded in ASCII decimal with no leading zeros
UTF8-encoded type ID
byte length of payload, encoded in ASCII decimal with no leading zeros
payload

Example: DSSEv1 31 https://in-toto.io/Statement/v1 434 {"subject": {...}, ...}

This is basically the same as the current PAE except it is simpler to implement. Note that this is a byte stream that happens to be start with unicode characters, and we never need to decode it.

Edit 1: reorder the fields to match the existing PAE.
Edit 2: clarify that it is a byte stream, and mention version 0.1.

Mark Lodato commented 3 years ago

Done!

Mark Lodato · Answer 1 · Mon Apr 19 2021 22:16:49 GMT+0800 (China Standard Time)

In fact, here is a bash implementation of the suggested simpler PAE:

local payloadTypeLength=$(LC_ALL=C; echo "${#payloadType}")
local payloadSize=$(stat -c %s "$payloadFile")
printf "%s" "DSSEv1 $payloadTypeLength $payloadType $payloadSize "
cat "$payloadFile"

Doing this with the current PAE would be more difficult.

Fredrik Skogman · Answer 2 · Mon Apr 26 2021 18:58:42 GMT+0800 (China Standard Time)

Wouldn't it be beneficial to base64 encode the payload first?

Mark Lodato · Answer 3 · Mon Apr 26 2021 21:31:40 GMT+0800 (China Standard Time)

@kommendorkapten No, I don't think so. Base64 is a transport detail that should get stripped off before it gets to the PAE stage. For example, if you parse the JSON with the protobuf library, it does the base64-decoding for you. Plus the protocol is encoding-agnostic, and base64 isn't used at all for CBOR, Avro, Protobuf, or any other binary encoding. Heck, you might have an "ASCII Armor" encoding like PGP where you wrap at 72 columns, in which case you can't feed it directly into PAE anyway.

Fredrik Skogman · Answer 4 · Tue Apr 27 2021 14:54:54 GMT+0800 (China Standard Time)

Makes as PAE is used to canonicalize the data prior to signature generation.

Heck, you might have an "ASCII Armor" encoding like PGP where you wrap at 72 columns, in which case you can't feed it directly into PAE anyway.

This I didn't get. An ASCII Armor encoding is a regular string (with newlines), so why couldn't that be fed into PAE? The result of PAE is an octet stream that won't be decoded, only used to generate a signature if I got it right.

Mark Lodato · Answer 5 · Tue Apr 27 2021 21:16:56 GMT+0800 (China Standard Time)

There are two steps to signing:

Protocol: Given a payload (bytes) and a payloadType (string), create a signature. PAE is used here to convert the 2-tuple into a byte stream. Base64 doesn't make sense here: payload is already bytes, so it serves no purpose to re-encode it.
Encoding: Given payload (bytes), payloadType (string), and signature (bytes), serialize it into an envelope that the verifier can decode. This is currently JSON. Base64 is needed for payload and signature because those are bytes but JSON only supports strings, and base64 converts bytes to strings.

The protocol is agnostic to the encoding. If you decode a JSON envelope and re-encode as CBOR, the signature should stay valid.

Now consider the proposal where PAE uses a base64-encoded payload:

If the envelope is CBOR, you'd need to temporarily base64-encode just for PAE. That is unnecessary processing and room for error. Do you use standard or URL-safe encoding? Do you include newlines (and if so, wrap at what width)?
If the envelope is ASCII Armor standard base64 wrapped at 72 columns, you'd need to decode it and re-encode as the canonical URL-safe, no-newline base64 encoding. So here, having base64 in PAE does you no good.
If the envelope is JSON but uses standard base64 encoding, again you're going to have to decode and re-encode before PAE.

You could say, "You sign whatever encoding the sender used." OK, but that would break the notion of being able to re-encode the envelope, and what would it buy you? AFAICT, the only difference is that you do the PAE before base64-decoding instead of after. But why is that worth the added complexity?

Fredrik Skogman · Answer 6 · Tue Apr 27 2021 22:03:21 GMT+0800 (China Standard Time)

I think we are talking about different things :)
I'm with you on not base64 encode prior to PAE, as that's how the current spec is written. As PAE in the current spec exactly what you wrote: convert the 2-tuple into a byte stream that is signed.

I looked at your example

Example: DSSEv1 31 https://in-toto.io/Statement/v1 434 {"subject": {...}, ...}

And I jumped the gun and thought that you meant for this PAE function to convert a 2-tuple to a string (not a byte stream).
So this proposal is still for PAE to create a byte stream, where the first few bytes just happens to be ASCII characters?

Mark Lodato · Answer 7 · Tue Apr 27 2021 22:11:10 GMT+0800 (China Standard Time)

Ah, so sorry for the confusion! 😄

Yes, the proposal is for it to still be a byte stream (to be fed to a cryptographic hash function), but the first few bytes just happen to be ASCII because it's easier to deal with ASCII. In particular, printf("%d", length) is easier than little_endian_64bit(length) in most languages.

Fredrik Skogman · Answer 8 · Tue Apr 27 2021 22:16:51 GMT+0800 (China Standard Time)

Got it, thanks for clarifying 👍

Wietse Z Venema · Answer 9 · Wed Jun 16 2021 23:09:40 GMT+0800 (China Standard Time)

Just a minor nit: when you write "concatenate the following elements, separated by spaces" you probably mean "separated by a single space character".

Otherwise, there can be ambiguity, for example when a length is followed by multiple spaces.

Mark Lodato · Answer 10 · Thu Jun 17 2021 02:26:06 GMT+0800 (China Standard Time)

Fixed, though the real text is at #37. Please take a look at that to see if it looks good to you.