Capturing data encoding in the multiformat

Question

Capturing data encoding in the multiformat

Gozala opened this issue 2 years ago · comments

Irakli Gozalishvili commented 2 years ago

Originally this came up here ucan-wg/ucan-cacao#2 (comment), but it's probably best to continue discussion here. Here is a short summary:

Current version of varsig is specified as follows

<varint sig_alg_code><vairint sig_size><bytes sig_output>

However since CACAO signs CBOR payload as opposed to JWT payload somehow we need to communicate what is the encoding of the payload in the signature.

One option was to simply expand list of sig_alg_codes to accommodate more payload formats. However it would imply allocating signature codes for each signature algorithm per each encoding. It also implies that if I have created a new codec not only I have to get a new multiformat code for the IPLD encoding, I also need to get set of codes for signature algorithms which is not great.

For this reason I think we should change format to following instead

<varint sig_alg_code><varint payload_encoding><vairint sig_size><bytes sig_output>

Irakli Gozalishvili · Answer 1 · Mon Nov 28 2022 02:00:40 GMT+0800 (China Standard Time)

Pulling @expede and @oed into this

Brooklyn Zelenka · Answer 2 · Mon Nov 28 2022 02:14:42 GMT+0800 (China Standard Time)

<varint sig_alg_code><varint payload_encoding><vairint sig_size><bytes sig_output>

Agreed! Let's do it 💪

Brooklyn Zelenka · Answer 3 · Mon Nov 28 2022 03:16:26 GMT+0800 (China Standard Time)

Okay, so I recognize that I've been a champion for the above previously, but I'm going to be annoying and give the devil's advocate view:

Should it include the hash function?
Is tracking the encoding the responsibility of the payload?

Hash Function

RS256 is RSA + SAH256. ECDSA is usually SHA256, but doesn't have to be. We could separate these out into separate fields...

<varint sig_alg_code><varint sig_hash><varint payload_encoding><vairint sig_size><bytes sig_output>
                     ^^^^^^^^^^^^^^^^^

...which is yet one more field / a byte or two extra.

Encoding

I started writing text here, but convinced myself that including the multicodec of the payload does make sense here. If you're signing e.g. a non-canonicalized JWT, you just signal it as 0x00 raw bytes in the signature.

Irakli Gozalishvili · Answer 4 · Mon Nov 28 2022 15:23:42 GMT+0800 (China Standard Time)

Should it include the hash function?

Do all signature algorithms have hashing functions ? If some don’t then question would arise what to do with those. Perhaps “identity” code would do the trick there.

I’m warming up to this idea, in fact we could simply reuse multihash and have format like

<varint sig_alg><varint payload_encoding><multihash>

If you're signing e.g. a non-canonicalized JWT, you just signal it as 0x00 raw bytes in the signature.

I would argue that we need a JWT multicodec code for that, because raw usually implies something else.

P.S.: 0x00 is identity multihash code, 0x55 is raw binary code

Joel Thorstensson · Answer 5 · Mon Nov 28 2022 16:51:06 GMT+0800 (China Standard Time)

I’m warming up to this idea, in fact we could simply reuse multihash and have format like

This is a bit backwards I think. The hashing function is what is used over the canonicalized payload. The signature itself is not a hash, so I don't think we can use multihash here.

Joel Thorstensson · Answer 6 · Mon Nov 28 2022 16:54:08 GMT+0800 (China Standard Time)

I started writing text here, but convinced myself that including the multicodec of the payload does make sense here. If you're signing e.g. a non-canonicalized JWT, you just signal it as 0x00 raw bytes in the signature.

I assume we need a canonicalization alg that describes how you take the payload and encode it as a JWT? If you just have the bytes of a raw JWT string that also needs to be signaled somehow? I guess it depends on how the data structure looks like where you get the JWT string and the signature?

btw, I'd prefer if we call it payload_canonicalization rather than payload_encoding.

Irakli Gozalishvili · Answer 7 · Mon Nov 28 2022 23:48:13 GMT+0800 (China Standard Time)

I assume we need a canonicalization alg that describes how you take the payload and encode it as a JWT? If you just have the bytes of a raw JWT string that also needs to be signaled somehow? I guess it depends on how the data structure looks like where you get the JWT string and the signature?

I mean this is from data model (of certain schema) to bytes. Which is why I call it encoding, it is a same code as in cid of the data.

btw, I'd prefer if we call it payload_canonicalization rather than payload_encoding.

but I want to use e.g. dag-cbor or dag-json depending on how you’ve encoded model to bytes before signing. Perhaps you’re saying canonicalization is yet another param ?

Irakli Gozalishvili · Answer 8 · Tue Nov 29 2022 09:26:43 GMT+0800 (China Standard Time)

@mikeal suggested that instead of <varint payload_encoding> we use <multiformat payload_encoding> instead. In common cases it could be just single varint but it also provides a way to include other canonicalization details in specific instances.

Joel Thorstensson · Answer 9 · Tue Nov 29 2022 17:07:07 GMT+0800 (China Standard Time)

I mean this is from data model (of certain schema) to bytes. Which is why I call it encoding, it is a same code as in cid of the data.

No this is not at all what I mean. Why would you need to include which IPLD encoding you are using? I assume you get this from the CID when you load and interpret the IPLD block?

We need a varint that represents how to go from ipld object -> serialized data to sign

For example:

ipld object -> JWT protected header + payload
ipld object -> SIWE message

Basically we need to know how to go from ipld data to the bytestring used to verify the signature.

but I want to use e.g. dag-cbor or dag-json depending on how you’ve encoded model to bytes before signing. Perhaps you’re saying canonicalization is yet another param ?

I don't see why this wouln't just be part of the canonicalization alg?

Irakli Gozalishvili · Answer 10 · Tue Nov 29 2022 23:26:01 GMT+0800 (China Standard Time)

@oed we mean same thing just use different terms. IPLD codec literally takes data and turns it into bytes

Joel Thorstensson · Answer 11 · Tue Nov 29 2022 23:30:06 GMT+0800 (China Standard Time)

@Gozala but we don't sign over IPLD encoded data. We sign over JWT data or SIWE messages.

Irakli Gozalishvili · Answer 12 · Tue Nov 29 2022 23:36:10 GMT+0800 (China Standard Time)

You can think of both as IPLD encoders and this came up in other context, where it literally is either dag-cbor or dag-json.

p.s.: I don’t care what we call it

Joel Thorstensson · Answer 13 · Tue Nov 29 2022 23:47:16 GMT+0800 (China Standard Time)

@Gozala I don't really follow. In the case of SIWE we have a bunch of data in various fields of the IPLD object. These are the steps I'm thinking about:

CID -> bytes from blockstore or network
bytes -> IPLD object using the IPLD codec
IPLD object -> SIWE message and signature (this step is what I call canonicalization)
Verify that signature is correct over SIWE message bytes

The other way around:

Generate SIWE message and sign it (signature)
SIWE message and signature -> IPLD object (canonicalization)
IPLD object -> bytes (IPLD codec)
hash(byes) -> CID

Irakli Gozalishvili · Answer 14 · Wed Nov 30 2022 01:09:56 GMT+0800 (China Standard Time)

CID -> bytes from blockstore or network

bytes -> IPLD object using the IPLD codec

IPLD object -> SIWE message and signature (this step is what I call canonicalization)

Verify that signature is correct over SIWE message bytes

These are definition of the IPLD encoder / decoder :

export interface BlockEncoder<Code extends number, T> {
  name: string
  code: Code
  encode: (data: T) => ByteView<T>
}


/**
 * IPLD decoder part of the codec.
 */
export interface BlockDecoder<Code extends number, T> {
  code: Code
  decode: (bytes: ByteView<T>) => T
}

So your steps are

Fetch bytes for CID
Codec.decode(bytes)
SIWECodec.encode(bytes)
PubKey.verify(SIWECodec.encode(bytes))

You are serializing some data into bytes in some format, which is what IPLD encoder is.

Joel Thorstensson · Answer 15 · Wed Nov 30 2022 01:40:08 GMT+0800 (China Standard Time)

Ok I see what you are saying now @Gozala. Thanks for clarifying!

Interestingly your example above is super clear for a JWT where the signature is part of the encoded message. For SIWE this is not the case. We will have the signed string separately from the signature bytes. There is no official way to encode these two together.

we could do something like this though:

Fetch bytes for CID
decoded = Codec.decode(bytes)
siweStr = SIWECodec.encode(decoded)
PubKey.verify(siweStr, decoded.signature)

Irakli Gozalishvili · Answer 16 · Wed Nov 30 2022 01:44:37 GMT+0800 (China Standard Time)

@oed oh yea sorry I forgot to add actual signature into verify, because they're separate in JWT cases as well

Joel Thorstensson · Answer 17 · Wed Nov 30 2022 01:47:50 GMT+0800 (China Standard Time)

@Gozala In JWTs they are not separate?
A JWT should be a string like this:
<base64url-protected-header>.<base64url-payload>.<base64url-signature>

Irakli Gozalishvili · Answer 18 · Wed Nov 30 2022 02:08:57 GMT+0800 (China Standard Time)

@Gozala In JWTs they are not separate?
A JWT should be a string like this:
<base64url-protected-header>.<base64url-payload>.<base64url-signature>

I mean it is, but you still pass first two segments as a payload and third as signature.

Joel Thorstensson · Answer 19 · Wed Nov 30 2022 04:06:16 GMT+0800 (China Standard Time)

Pretty sure it differs per implementation. Most that I've seen you just pass the entire JWT string.

True for both of these:

Joel Thorstensson · Answer 20 · Wed Nov 30 2022 18:48:50 GMT+0800 (China Standard Time)

Trying to figure out how to represent a DagJOSE (JWS) as a varsig.

We have,

<varint sig_alg_code><varint payload_encoding><vairint sig_size><bytes sig_output>

Naive approach would be:

sig_alg_code: 0xd0ed
payload_encoding: 0x85

However, this doesn't really cut it since dag-jose only says how to go from a JWS-string to dag-jose bytes, not some arbitrary structure to bytes.

So it seems like we will need to register a new payload_encoding for every possible payload we have?

For example we would need to define:

a SIWE codec that is only usable for the way we represent SIWE + signature on ipld
a UCAN codec that is only usable for the way we represent UCANs in ipld

This means that we also need a specific codec for invocations as well?

Maybe I'm missing something here?