Double-signing prevention (MVP for launch)

Question

Double-signing prevention (MVP for launch)

tarcieri opened this issue 6 years ago · comments

This is a tracking issue for KMS double-signing prevention.

Launch goal: attempt to prevent validator bugs or data loss from causing the validator to double sign by making the KMS aware of the current block height.

Longer-term goal: provide defensive capabilities / survive compromise of validator hosts. See #115 for discussion of post-launch double signing improvements.

Previous discussion:

Current Status

The main blocker for implementing this right now is #111 - every key in the KMS needs to be tagged with a tendermint::chain::Id, so the KMS can look up the tendermint::block::Height for that particular chain.
signatory-ledger-cosval signer can track current block height (i.e. it passes the current block height to the Cosmos app running on the Ledger hardware device which persists the previous signed block height), and will refuse to double-sign.

Launch Plan

Add support to the KMS for a user-configurable subcommand for obtaining the current block height. This can be used to "bootstrap" a block height value when a KMS process is started. From there, the KMS can track the last block it signed.

For example, the KMS could call out to a shell script which hits the /status RPC endpoint for a validator's sentries, piping the output through e.g. jq to extract "latest_block_height" and sorting the results, taking the highest value. An example script can be included in the KMS repo which people can customize to their needs.

This should allow validators to choose whatever mechanism they like for providing the KMS with the current block height, and implement e.g. storing the current block height in external databases as proposed in #11.

Longer-Term Plan

See #115.

Juan Leni · Answer 1 · Thu Oct 04 2018 01:49:15 GMT+0800 (China Standard Time)

Just a short clarification: signatory-ledger-cosval is agnostic to these concepts, it will just pass messages to the ledger device. It is actually the ledger app/firmware that will track current block height and round.

The device only signs blocks in incremental order. To define the initial/current block height, the block height of the first KMS request after plugging is used as a reference. Setting this initial value requires manual user confirmation in the ledger device.

A similar approach could be used by signatory providers. A good idea would be to create a signatory provider wrapper with this functionality.

Tony Arcieri · Answer 2 · Thu Oct 04 2018 02:09:49 GMT+0800 (China Standard Time)

The device only signs blocks in incremental order.

@jleni does it just preserve monotonicity, or does it require each signed block immediately follow the previous one?

I am wondering about things like failover.

Juan Leni · Answer 3 · Thu Oct 04 2018 02:21:02 GMT+0800 (China Standard Time)

This behavior is according to the specs. It will sign in monotonic order, they do not need to be sequential.
For instance:

n, n+1, n+3 (all are signed)
n, n+2, n+1, n+3. (only n+1 is rejected)

Adding Failover/HA to KMS is an interesting follow up. It might actually need KMS/signatory arbitration + something like Raft (consensus) to handle these cases.

Martin Dyring-Andersen · Answer 4 · Fri Nov 23 2018 16:55:19 GMT+0800 (China Standard Time)

My understanding - please correct me if wrong - is that HSM2 double signing prevention will be implemented by tracking the last signed height, which is persisted in one of the slots of the device.

KMS will then need to update this slot before signing each block, and should ideally read the data back to ensure it was stored correctly.

To make this as robust as possible, if the update/read cycle fails, KMS should complain loudly, but continue operating in degraded state. It could still prevent double signing by using locally cached data, and I guess(?) the HSM2 might still continue signing.

Apologies if this seems trivial, but I thought it was worth stressing since I couldn't find any MTBF data on the HSM2. Validator failure due to write wear on the HSM would be the worst.

TL;DR - the failure characteristics of the underlying devices (YubiHSM2, Ledger etc.) should be carefully considered. They might not be designed for write intensive operations, including double signing prevention, and could wear out.

Tony Arcieri · Answer 5 · Sat Nov 24 2018 03:45:17 GMT+0800 (China Standard Time)

@mdyring that's a good point. I will investigate that before I go forward with this approach.

Juan Leni · Answer 6 · Sat Nov 24 2018 04:05:42 GMT+0800 (China Standard Time)

I can answer about Ledger nano S. Yes, nvram is rated at 500k erase/write cycles. Actually, it is a bit more complicated due to write amplification as pages are aligned at 64-byte boundaries. Anyway, we had these limitations in mind. Block number/rounds are tracked in RAM, not nvram to avoid this issue. When the device is plugged will skip first votes and request the user for confirmation to align with current values. I hope this answers your question.

…

On Fri, 23 Nov 2018, 09:55 Martin Dyring-Andersen ***@***.*** wrote: My understanding - please correct me if wrong - is that HSM2 double signing prevention will be implemented by tracking the last signed height, which is persisted in one of the slots of the device. KMS will then need to update this slot before signing each block, and should ideally read the data back to ensure it was stored correctly. To make this as robust as possible, if the update/read cycle fails, KMS should complain loudly, but continue operating in degraded state. It could still prevent double signing by using locally cached data, and I guess(?) the HSM2 might still continue signing. Apologies if this seems trivial, but I thought it was worth stressing since I couldn't find any MTBF data on the HSM2. Validator failure due to write wear on the HSM would be the silly. TL;DR - the failure characteristics of the underlying devices (YubiHSM2, Ledger etc.) should be carefully considered. They might not be designed for write intensive operations, including double signing prevention, and could wear out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#60 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANEF2MDjUgzVMMmcsiQhlNuGzfv5lpZZks5ux7f3gaJpZM4XGfoj> .

Martin Dyring-Andersen · Answer 7 · Sat Nov 24 2018 18:27:19 GMT+0800 (China Standard Time)

Block number/rounds are tracked in RAM, not nvram to avoid this issue. When the device is plugged will skip first votes and request the user for confirmation to align with current values. I hope this answers your question.

That seems like a sensible solution. In case KMS needs to be restarted, would this require physical access to the ledger device for confirmation? Ideally it should be possible to restart/update software remotely.

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

Tony Arcieri · Answer 8 · Wed Dec 12 2018 23:44:15 GMT+0800 (China Standard Time)

We talked with Yubico about wear on the flash. One of their reps suggested it wouldn't be an issue, although it's something I'm loathe to risk without a precise MTBF. The last thing we want is a bunch of validators dying at the same time because they all wore out their flash roughly at the same time.

Adrian Brink · Answer 9 · Thu Feb 14 2019 22:25:48 GMT+0800 (China Standard Time)

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

This case won't work with the Ledger, since a power-cycle will restart the Ledger. In order to unlock the ledger you have to physically enter your PIN, which means you have to be at the datacenter anyway. That's why persistent storage for height/round is not as important with the ledger.

Tony Arcieri · Answer 10 · Thu Feb 28 2019 01:09:48 GMT+0800 (China Standard Time)

I'm just about ready to (finally) start work on this. Here is a tentative plan:

Add a configuration section to tmkms.toml for Tendermint blockchain networks the KMS is operating on. This can include the Bech32 prefixes used by that network (addressing #178, although it's still unclear if all Tendermint networks will adopt Bech32)
Make chain ID information configured for keys first-class (it's presently a string). This will tie signing keys directly to specific chains (addressing #111, which is something of a showstopper security issue)
Persist state files for different network containing e.g. the block height. The exact details of this are still TBD.

There was some earlier discussion of persisting this information in e.g. YubiHSM2's opaque data. Due to concerns about write wear, I don't think this is a good idea.

Another alternative would be introducing some kind of embedded database, e.g. sled, LevelDB, or LMDB. That seems like a complicated change to introduce right now, and also something where the other potential / future persistence needs of the KMS need to be considered.

The existing priv_validator_state.json files provide at least something of a known quantity people are familiar with. I would suggest implementing a similar approach (but with at least one of these files per Tendermint network/chain). This is a KISS solution that can be potentially be replaced by an embedded (or non-embedded) backend database, but at such a time where we're actually ready to cross that bridge.

Juan Leni · Answer 11 · Thu Feb 28 2019 02:32:01 GMT+0800 (China Standard Time)

with respect to 2.
Can we have something like this? #177 (comment)

keys = [{ id = "gaia-6000", pubkey="123....", key = 1 }]
keys = [{ id = "gaia-7000", pubkey="123....", key = 2 }]
keys = [{ id = "gaia-9000", pubkey="456....", key = 1 }]

This would be more secure and allow support several devices connected to the same KMS.

Tony Arcieri · Answer 12 · Sat Mar 02 2019 12:01:11 GMT+0800 (China Standard Time)

@jleni I'm not sure that syntax is valid TOML... it seems like you want:

[[keys]]
id = "gaia-6000"
pubkey="123...."
key = 1

[[keys]]
id = "gaia-7000"
pubkey="123...."
key = 2

?

I'd agree the config syntax needs changes, but the main thing that needs to be solved, particularly in the context of this issue, is expressing an m:n mapping between Tendermint networks/chains and keys.

With your proposed syntax, I think we'd need at least:

{ key = 1, chains = ["gaia-6000", "gaia-7000"], ... }

Juan Leni · Answer 13 · Sat Mar 02 2019 17:48:10 GMT+0800 (China Standard Time)

Yes sorry! :) I was just trying to explain the high level idea.

…

On Sat, 2 Mar 2019, 05:01 Tony Arcieri, ***@***.***> wrote: @jleni <https://github.com/jleni> I'm not sure that syntax is valid TOML... it seems like you want: [[keys]] id = "gaia-6000" pubkey="123...." key = 1 [[keys]] id = "gaia-7000" pubkey="123...." key = 2 ? I'd agree the config syntax needs changes, but the main thing that needs to be solved, particularly in the context of this issue, is expressing an m:n mapping between Tendermint networks/chains and keys. With your proposed syntax, I think we'd need at least: { key = 1, chains = ["gaia-6000", "gaia-7000"], ... } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#60 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANEF2MTrntURPR31rn-kUfNNjXJcYt1vks5vSfeHgaJpZM4XGfoj> .

Martin Dyring-Andersen · Answer 14 · Thu Mar 07 2019 03:40:28 GMT+0800 (China Standard Time)

Regarding support for multiple chains, it would be very nice if a single tmkms instance could support multiple (tendermint based, running compatible version, etc.) projects, such as IRIS, Cosmos, IOV, etc.

Could be it something as simple as making the HRP of bech32 addresses configurable?

Tony Arcieri · Answer 15 · Thu Mar 07 2019 03:45:15 GMT+0800 (China Standard Time)

@mdyring that's definitely the plan. See #178

Tony Arcieri · Answer 16 · Mon Mar 11 2019 05:23:18 GMT+0800 (China Standard Time)

Double signing tracking was added in #193 (thanks @zaki!) and the rest of this plan (i.e. hook support) implemented in #205.

It will be released in 0.5.0. I plan on having 0.5.0-beta1 out later today.