net/raft: panic joining existing cluster
jbowens opened this issue · comments
Steps I followed:
- Start
LISTEN=localhost:1999 cored
. - Run
corectl init
to initialize the raft cluster. - Run
corectl config-generator
to configure Chain Core as a generator. - Run
corectl allow-address localhost:1998
to prepare for a second cored instance. - Start
LISTEN=localhost:1998 CHAIN_CORE_HOME=~/.chaincore2 cored
- Run
CORE_URL=https://localhost:1998 corectl join
- Stop the localhost:1999 cored.
- Restart the localhost:1998 cored.
Stack trace:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17a6200]
goroutine 23 [running]:
chain/vendor/github.com/coreos/etcd/raft.(*raft).appendEntry(0xc4202980f0, 0xc420075c70, 0x1, 0x1)
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:520 +0x220
chain/vendor/github.com/coreos/etcd/raft.(*raft).becomeLeader(0xc4202980f0)
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:620 +0x431
chain/vendor/github.com/coreos/etcd/raft.(*raft).campaign(0xc4202980f0, 0x1bb1ba7, 0x10)
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:643 +0x959
chain/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4202980f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/raft.go:753 +0x1510
chain/vendor/github.com/coreos/etcd/raft.(*node).run(0xc420156fc0, 0xc4202980f0)
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/node.go:323 +0x64f
created by chain/vendor/github.com/coreos/etcd/raft.RestartNode
/Users/jackson/src/chain/vendor/github.com/coreos/etcd/raft/node.go:223 +0x390
Here's a zip of the Chain Core data directory. The same panic is hit every restart.
chaincorehome-repro.zip
Looks like the replicated snapshot is from before the new node was added to the node list, but there's also zero entries in the WAL after that snapshot.
Recovering from snapshot: index 9, term 3
Snapshot ConfState: raftpb.ConfState{Nodes:[]uint64{0x1}, XXX_unrecognized:[]uint8(nil)}
Appending 0 entries not in snapshot.
raft: INFO: 33 became follower at term 0
raft: INFO: newRaft 33 [peers: [1], term: 0, commit: 9, applied: 9, lastindex: 9, lastterm: 3]
raft: INFO: 33 is starting a new election at term 0
raft: INFO: 33 became candidate at term 1
raft: INFO: 33 received MsgVoteResp from 33 at term 1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x179b0b0]
Judging by etcdserver's code, I think we should be using StartNode in Join, not RestartNode.
https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L299-L323
https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L381-L420
But that might be a meaningless distinction if we would just pass in nil for peers anyways.