Onyx-Protocol / Onyx

Onyx

Home Page:https://Onyx.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

net/raft: panic joining existing cluster

jbowens opened this issue · comments

Steps I followed:

  1. Start LISTEN=localhost:1999 cored.
  2. Run corectl init to initialize the raft cluster.
  3. Run corectl config-generator to configure Chain Core as a generator.
  4. Run corectl allow-address localhost:1998 to prepare for a second cored instance.
  5. Start LISTEN=localhost:1998 CHAIN_CORE_HOME=~/.chaincore2 cored
  6. Run CORE_URL=https://localhost:1998 corectl join
  7. Stop the localhost:1999 cored.
  8. Restart the localhost:1998 cored.

Stack trace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17a6200]

goroutine 23 [running]:
chain/vendor/github.com/coreos/etcd/raft.(*raft).appendEntry(0xc4202980f0, 0xc420075c70, 0x1, 0x1)
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:520 +0x220
chain/vendor/github.com/coreos/etcd/raft.(*raft).becomeLeader(0xc4202980f0)
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:620 +0x431
chain/vendor/github.com/coreos/etcd/raft.(*raft).campaign(0xc4202980f0, 0x1bb1ba7, 0x10)
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:643 +0x959
chain/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4202980f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:753 +0x1510
chain/vendor/github.com/coreos/etcd/raft.(*node).run(0xc420156fc0, 0xc4202980f0)
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/node.go:323 +0x64f
created by chain/vendor/github.com/coreos/etcd/raft.RestartNode
    /Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/node.go:223 +0x390

Here's a zip of the Chain Core data directory. The same panic is hit every restart.
chaincorehome-repro.zip

Looks like the replicated snapshot is from before the new node was added to the node list, but there's also zero entries in the WAL after that snapshot.

Recovering from snapshot: index 9, term 3
Snapshot ConfState: raftpb.ConfState{Nodes:[]uint64{0x1}, XXX_unrecognized:[]uint8(nil)}
Appending 0 entries not in snapshot.
raft: INFO: 33 became follower at term 0
raft: INFO: newRaft 33 [peers: [1], term: 0, commit: 9, applied: 9, lastindex: 9, lastterm: 3]
raft: INFO: 33 is starting a new election at term 0
raft: INFO: 33 became candidate at term 1
raft: INFO: 33 received MsgVoteResp from 33 at term 1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x179b0b0]

Judging by etcdserver's code, I think we should be using StartNode in Join, not RestartNode.

https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L299-L323
https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L381-L420

But that might be a meaningless distinction if we would just pass in nil for peers anyways.

Fixed in #1332 and #1335.