tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Double-signing prevention (MVP for launch)

tarcieri opened this issue · comments

This is a tracking issue for KMS double-signing prevention.

Launch goal: attempt to prevent validator bugs or data loss from causing the validator to double sign by making the KMS aware of the current block height.

Longer-term goal: provide defensive capabilities / survive compromise of validator hosts. See #115 for discussion of post-launch double signing improvements.

Previous discussion:

Current Status

  • The main blocker for implementing this right now is #111 - every key in the KMS needs to be tagged with a tendermint::chain::Id, so the KMS can look up the tendermint::block::Height for that particular chain.
  • signatory-ledger-cosval signer can track current block height (i.e. it passes the current block height to the Cosmos app running on the Ledger hardware device which persists the previous signed block height), and will refuse to double-sign.

Launch Plan

Add support to the KMS for a user-configurable subcommand for obtaining the current block height. This can be used to "bootstrap" a block height value when a KMS process is started. From there, the KMS can track the last block it signed.

For example, the KMS could call out to a shell script which hits the /status RPC endpoint for a validator's sentries, piping the output through e.g. jq to extract "latest_block_height" and sorting the results, taking the highest value. An example script can be included in the KMS repo which people can customize to their needs.

This should allow validators to choose whatever mechanism they like for providing the KMS with the current block height, and implement e.g. storing the current block height in external databases as proposed in #11.

Longer-Term Plan

See #115.

Just a short clarification: signatory-ledger-cosval is agnostic to these concepts, it will just pass messages to the ledger device. It is actually the ledger app/firmware that will track current block height and round.

The device only signs blocks in incremental order. To define the initial/current block height, the block height of the first KMS request after plugging is used as a reference. Setting this initial value requires manual user confirmation in the ledger device.

A similar approach could be used by signatory providers. A good idea would be to create a signatory provider wrapper with this functionality.

The device only signs blocks in incremental order.

@jleni does it just preserve monotonicity, or does it require each signed block immediately follow the previous one?

I am wondering about things like failover.

This behavior is according to the specs. It will sign in monotonic order, they do not need to be sequential.
For instance:

  • n, n+1, n+3 (all are signed)
  • n, n+2, n+1, n+3. (only n+1 is rejected)

Adding Failover/HA to KMS is an interesting follow up. It might actually need KMS/signatory arbitration + something like Raft (consensus) to handle these cases.

My understanding - please correct me if wrong - is that HSM2 double signing prevention will be implemented by tracking the last signed height, which is persisted in one of the slots of the device.

KMS will then need to update this slot before signing each block, and should ideally read the data back to ensure it was stored correctly.

To make this as robust as possible, if the update/read cycle fails, KMS should complain loudly, but continue operating in degraded state. It could still prevent double signing by using locally cached data, and I guess(?) the HSM2 might still continue signing.

Apologies if this seems trivial, but I thought it was worth stressing since I couldn't find any MTBF data on the HSM2. Validator failure due to write wear on the HSM would be the worst.

TL;DR - the failure characteristics of the underlying devices (YubiHSM2, Ledger etc.) should be carefully considered. They might not be designed for write intensive operations, including double signing prevention, and could wear out.

@mdyring that's a good point. I will investigate that before I go forward with this approach.

Block number/rounds are tracked in RAM, not nvram to avoid this issue. When the device is plugged will skip first votes and request the user for confirmation to align with current values. I hope this answers your question.

That seems like a sensible solution. In case KMS needs to be restarted, would this require physical access to the ledger device for confirmation? Ideally it should be possible to restart/update software remotely.

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

We talked with Yubico about wear on the flash. One of their reps suggested it wouldn't be an issue, although it's something I'm loathe to risk without a precise MTBF. The last thing we want is a bunch of validators dying at the same time because they all wore out their flash roughly at the same time.

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

This case won't work with the Ledger, since a power-cycle will restart the Ledger. In order to unlock the ledger you have to physically enter your PIN, which means you have to be at the datacenter anyway. That's why persistent storage for height/round is not as important with the ledger.

I'm just about ready to (finally) start work on this. Here is a tentative plan:

  1. Add a configuration section to tmkms.toml for Tendermint blockchain networks the KMS is operating on. This can include the Bech32 prefixes used by that network (addressing #178, although it's still unclear if all Tendermint networks will adopt Bech32)
  2. Make chain ID information configured for keys first-class (it's presently a string). This will tie signing keys directly to specific chains (addressing #111, which is something of a showstopper security issue)
  3. Persist state files for different network containing e.g. the block height. The exact details of this are still TBD.

There was some earlier discussion of persisting this information in e.g. YubiHSM2's opaque data. Due to concerns about write wear, I don't think this is a good idea.

Another alternative would be introducing some kind of embedded database, e.g. sled, LevelDB, or LMDB. That seems like a complicated change to introduce right now, and also something where the other potential / future persistence needs of the KMS need to be considered.

The existing priv_validator_state.json files provide at least something of a known quantity people are familiar with. I would suggest implementing a similar approach (but with at least one of these files per Tendermint network/chain). This is a KISS solution that can be potentially be replaced by an embedded (or non-embedded) backend database, but at such a time where we're actually ready to cross that bridge.

with respect to 2.
Can we have something like this? #177 (comment)

keys = [{ id = "gaia-6000", pubkey="123....", key = 1 }]
keys = [{ id = "gaia-7000", pubkey="123....", key = 2 }]
keys = [{ id = "gaia-9000", pubkey="456....", key = 1 }]

This would be more secure and allow support several devices connected to the same KMS.

@jleni I'm not sure that syntax is valid TOML... it seems like you want:

[[keys]]
id = "gaia-6000"
pubkey="123...."
key = 1

[[keys]]
id = "gaia-7000"
pubkey="123...."
key = 2

?

I'd agree the config syntax needs changes, but the main thing that needs to be solved, particularly in the context of this issue, is expressing an m:n mapping between Tendermint networks/chains and keys.

With your proposed syntax, I think we'd need at least:

{ key = 1, chains = ["gaia-6000", "gaia-7000"], ... }

Regarding support for multiple chains, it would be very nice if a single tmkms instance could support multiple (tendermint based, running compatible version, etc.) projects, such as IRIS, Cosmos, IOV, etc.

Could be it something as simple as making the HRP of bech32 addresses configurable?

@mdyring that's definitely the plan. See #178

Double signing tracking was added in #193 (thanks @zaki!) and the rest of this plan (i.e. hook support) implemented in #205.

It will be released in 0.5.0. I plan on having 0.5.0-beta1 out later today.