cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:

Home Page:https://cosmos.network/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consensus failure after validator is slashed

kidinamoto01 opened this issue · comments

My validator (Add: 295C0821D6D2EC71772E86773CD7F46F072CB764) is supposed to got slashed, but somehow it still send out the pre-vote messages on the same height, then the network has a consensus failure.

consensus state:

{
  "jsonrpc": "2.0",
  "id": "",
  "result": {
    "round_state": {
      "height/round/step": "60522/0/8",
      "start_time": "2018-06-10T20:25:16.272777362+08:00",
      "proposal_block_hash": "1609FBFF1CFC8AC12253C521DBE75D4D074BC0A8",
      "locked_block_hash": "1609FBFF1CFC8AC12253C521DBE75D4D074BC0A8",
      "valid_block_hash": "",
      "height_vote_set": [
        {
          "round": 0,
          "prevotes": [
            "Vote{0:09D7D7DFF2F5 60522/00/1(Prevote) 1609FBFF1CFC /3F0AB105005E.../ @ 2018-06-10T12:13:39.110Z}",
            "Vote{1:295C0821D6D2 60522/00/1(Prevote) 1609FBFF1CFC /5756233AEB09.../ @ 2018-06-10T12:25:15.676Z}",
            "Vote{2:38E4E59F5A12 60522/00/1(Prevote) 1609FBFF1CFC /BF493716F056.../ @ 2018-06-10T12:13:38.974Z}",
            "Vote{3:489222C1FA4C 60522/00/1(Prevote) 1609FBFF1CFC /239C2C75FD97.../ @ 2018-06-10T12:13:39.160Z}",
            "Vote{4:8C4644F001AF 60522/00/1(Prevote) 1609FBFF1CFC /5A20C1D8D21F.../ @ 2018-06-10T12:13:39.082Z}",
            "Vote{5:942BC57C8B90 60522/00/1(Prevote) 1609FBFF1CFC /2AE8B2FA50CC.../ @ 2018-06-10T12:13:38.772Z}",
            "Vote{6:94F711AE2B54 60522/00/1(Prevote) 1609FBFF1CFC /FEEAA88700AA.../ @ 2018-06-10T12:13:39.114Z}",
            "Vote{7:A1BDEA32E5B9 60522/00/1(Prevote) 1609FBFF1CFC /B5834B3C65F3.../ @ 2018-06-10T12:13:39.113Z}",
            "Vote{8:BF03F34B2A17 60522/00/1(Prevote) 1609FBFF1CFC /BF00979293FE.../ @ 2018-06-10T12:13:39.188Z}",
            "Vote{9:C59779BF393E 60522/00/1(Prevote) 1609FBFF1CFC /7A41CF06DFA7.../ @ 2018-06-10T12:13:39.066Z}",
            "Vote{10:D4DA32C97268 60522/00/1(Prevote) 1609FBFF1CFC /CD61F18401FB.../ @ 2018-06-10T12:13:37.764Z}",
            "Vote{11:D6DFA0D120BA 60522/00/1(Prevote) 1609FBFF1CFC /5655D73B6502.../ @ 2018-06-10T12:13:39.087Z}",
            "Vote{12:E647CEB57BC6 60522/00/1(Prevote) 1609FBFF1CFC /D3886CFA3886.../ @ 2018-06-10T12:13:39.131Z}"
          ],
          "prevotes_bit_array": "BA{13:xxxxxxxxxxxxx} 12114/12114 = 1.00",
          "precommits": [
            "Vote{0:09D7D7DFF2F5 60522/00/2(Precommit) 1609FBFF1CFC /CDBB93CCBC01.../ @ 2018-06-10T12:13:39.434Z}",
            "Vote{1:295C0821D6D2 60522/00/2(Precommit) 1609FBFF1CFC /9E130BFC03D9.../ @ 2018-06-10T12:25:15.681Z}",
            "nil-Vote",
            "Vote{3:489222C1FA4C 60522/00/2(Precommit) 1609FBFF1CFC /D26AC74E26C9.../ @ 2018-06-10T12:13:39.411Z}",
            "Vote{4:8C4644F001AF 60522/00/2(Precommit) 1609FBFF1CFC /9881D80BA1D8.../ @ 2018-06-10T12:13:39.382Z}",
            "nil-Vote",
            "Vote{6:94F711AE2B54 60522/00/2(Precommit) 1609FBFF1CFC /A7DBBCB2DDFB.../ @ 2018-06-10T12:13:39.367Z}",
            "Vote{7:A1BDEA32E5B9 60522/00/2(Precommit) 1609FBFF1CFC /132B436A9790.../ @ 2018-06-10T12:13:39.394Z}",
            "nil-Vote",
            "Vote{9:C59779BF393E 60522/00/2(Precommit) 1609FBFF1CFC /C77FF65C9699.../ @ 2018-06-10T12:13:39.406Z}",
            "Vote{10:D4DA32C97268 60522/00/2(Precommit) 1609FBFF1CFC /859EC03A0A03.../ @ 2018-06-10T12:13:38.026Z}",
            "Vote{11:D6DFA0D120BA 60522/00/2(Precommit) 1609FBFF1CFC /D1A6D6F6810F.../ @ 2018-06-10T12:13:39.443Z}",
            "nil-Vote"
          ],
          "precommits_bit_array": "BA{13:xx_xx_xx_xxx_} 8111/12114 = 0.67"
        },
        {
          "round": 1,
          "prevotes": [
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote"
          ],
          "prevotes_bit_array": "BA{13:_____________} 0/12114 = 0.00",
          "precommits": [
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote",
            "nil-Vote"
          ],
          "precommits_bit_array": "BA{13:_____________} 0/12114 = 0.00"
        }
      ]
    }
  }
}

related error message:

I[06-10|13:42:11.017] Absent validator 295C0821D6D2EC71772E86773CD7F46F072CB764 at height 60522 module=x/slashing
I[06-10|13:42:11.017] Validator 295C0821D6D2EC71772E86773CD7F46F072CB764 past min height of 100 and below signed blocks threshold of 50 module=x/slashing
panic: validator record not found for address: [211 220 15 245 159 124 59 84 139 122 250 54 85 97 184 127 208 32 138 248]


goroutine 1 [running]:
github.com/cosmos/cosmos-sdk/x/stake.Keeper.updateBondedValidators(0xed0c60, 0xc4209ad880, 0xc4209625a0, 0xed0c60, 0xc4209ad860, 0xedb2c0, 0xc420a54dc0, 0xc4209625a0, 0x4, 0xed7280, ...)

After an extensive debugging session, we were able to track the problem to a bug in the staking data model.

We are making some of the debugging tooling available in a new gaiadebug command (#1232).

Note the address here:

panic: validator record not found for address: [211 220 15 245 159 124 59 84 139 122 250 54 85 97 184 127 208 32 138 248]

Corresponds to D3DC0FF59F7C3B548B7AFA365561B87FD0208AF8. We refer to this address here as trouble

Summary

We observed the following sequence of events in the blockchain:

  • Height 46064 - trouble declares candidacy
  • Height 46067 - trouble gets auto-unbonded (due to #1199)
  • Height 53832 - trouble unbonds

At this point, trouble's validator record is successfully removed from the store but a secondary record gets left behind. This violates an invariant that is supposed to be preserved.

  • Height 60522 - an independent validator get's auto-unbonded

At this point, we loop through that set of secondary records and for each one attempt to load the primary record. We include a sanity check to ensure that the primary record exists. But since the primary record got deleted while the secondary one did not, the sanity check failed and we paniced with validator record not found for address.

The reason the secondary record did not get deleted is as follows:

  • The secondary record indexes validators by the power, in terms of their bonded shares

  • On gaia-6001, inflationary steak is directly bonded, increasing the exchange rate of atoms to shares

  • This exchange rate is used in computing the power for the secondary index

  • Since the exchange rate is changing due to inflation, eventually the power is incorrectly computed

  • Height 50511 - the exchange rate changes sufficiently to change the computed power for the trouble validator

Since the computed power changes, when trouble finally unbonds at height 53832, we compute the wrong key for the secondary record and therefore fail to remove it.

At height 60522, when the other validator unbonds and we loop through the secondary records, we find this stale record fails the sanity check, so we panic.

Please note we are taking an explicit "FAIL CLOSED" approach to the software - we include liberal sanity checks in the code to ensure certain invariants aren't being violated. This ensures that we discover bugs sooner than later and halt the blockchain to fix them, rather than the chain continuing to run and be potentially exploited.

Conclusion

Independent of this bug, the staking specification was updated to not have inflationary coins automatically bonded. This would remove the relevance of the exchange rate and actually prevent this kind of bug.

We will make an immediate release that has some minor fixes and also removed bonded inflation for the time being, in order to allow a new testnet to restart. Note this release will be breaking and will require a new chain-id and restarting from the genesis block.

In the near-term, we will release a new version of the staking module that reflects the new design.

Additionally, we are working on randomized testing to catch more bugs sooner.

We are also beginning a careful code review in an effort to improve the structure of the code.

Thanks @ebuchman for the root cause analysis, that was very helpful. Will this be linked to in release notes? If not I think this should be better documented by having a dedicated place for it, release notes would be a good place for that.

It's already been added to the STATUS.md: https://github.com/cosmos/cosmos-sdk/blob/master/cmd/gaia/testnets/STATUS.md#june-13-2018-230-est---published-postmortem-of-gaia-6001-failure

And yes, we will also link to it from the new release notes.

Think that's all sufficient? Or should we copy it for instance into a markdown file in the testnets/gaia-6001 folder ?

I think just linking to this somewhere intuitive or that people would normally check between releases (e.g., release notes) would allow validators to understand what got into a new release and what was wrong with a previous release or why something was removed/added. This is quite sufficient, at least for me, thanks.

Fixed in v0.19.0 on master. See https://github.com/cosmos/cosmos-sdk/tree/master/cmd/gaia/testnets for connecting to the new gaia-6002.

Thanks all!