etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system

Home Page:https://etcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Server runtime error when execute Watch command

marouane-kech opened this issue · comments

I try to test the wath command with ( curl -L http://127.0.0.1:4001/v2/keys/message?wait=true ) and I do the same test with python-client. I got the same result:
the etcd instance crash with exeption :

panic serving 127.0.0.1:32907: runtime error: invalid memory address or nil pointer dereference
goroutine 758 [running]:
net/http.func·007()
/usr/local/go/src/pkg/net/http/server.go:1022 +0x9f
sync/atomic.AddUint64()
/usr/local/go/src/pkg/sync/atomic/asm_386.s:69 +0xc
github.com/coreos/etcd/store.(_watcherHub).watch(0x1868b4c0, 0x20186080, 0x8, 0x0, 0x23c, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/store/watcher_hub.go:71 +0x165
github.com/coreos/etcd/store.(_store).Watch(0x1868e480, 0x20186080, 0x8, 0x82a0400, 0x0, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/store/store.go:295 +0x151
github.com/coreos/etcd/server/v2.GetHandler(0x1e53be20, 0x1d2a3c80, 0x1d254310, 0x1ec206f0, 0x18691540, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/v2/get_handler.go:49 +0x455
github.com/coreos/etcd/server.func·003(0x1e53be20, 0x1d2a3c80, 0x1d254310, 0x0, 0x84be0e0, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/server.go:141 +0x66
github.com/coreos/etcd/server.func·004(0x1e53be20, 0x1d2a3c80, 0x1d254310)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/server.go:162 +0x23b
net/http.HandlerFunc.ServeHTTP(0x1868bda0, 0x1e53be20, 0x1d2a3c80, 0x1d254310)
/usr/local/go/src/pkg/net/http/server.go:1149 +0x3c
github.com/gorilla/mux.(_Router).ServeHTTP(0x18666540, 0x1e53be20, 0x1d2a3c80, 0x1d254310)
/home/tprojadmin/etcd/src/github.com/gorilla/mux/mux.go:90 +0x193
net/http.serverHandler.ServeHTTP(0x18691540, 0x1e53be20, 0x1d2a3c80, 0x1d254310)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x121
net/http.(_conn).serve(0x1f3f5190)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x653
created by net/http.(_Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x242
^Ctprojadmin@tproj-devt-04:~/etcd$ ./etcd -data-dir node0 -name node0
[etcd] Dec 2 15:41:49.280 INFO | etcd server [name node0, listen on 127.0.0.1:4001, advertised url http://127.0.0.1:4001]
[etcd] Dec 2 15:42:26.081 INFO | URLs: / node0 (http://127.0.0.1:7001,http://127.0.0.1:7002,http://127.0.0.1:7003)
[etcd] Dec 2 15:42:26.082 WARNING | the entire cluster is down! this peer will restart the cluster.
[etcd] Dec 2 15:42:26.082 INFO | raft server [name node0, listen on 127.0.0.1:7001, advertised url http://127.0.0.1:7001]
2013/12/02 15:43:32 http: panic serving 127.0.0.1:55893: runtime error: invalid memory address or nil pointer dereference
goroutine 781 [running]:
net/http.func·007()
/usr/local/go/src/pkg/net/http/server.go:1022 +0x9f
sync/atomic.AddUint64()
/usr/local/go/src/pkg/sync/atomic/asm_386.s:69 +0xc
github.com/coreos/etcd/store.(_watcherHub).watch(0x1868b4c0, 0x20eb8030, 0x8, 0x0, 0x23c, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/store/watcher_hub.go:71 +0x165
github.com/coreos/etcd/store.(_store).Watch(0x1868e480, 0x20eb8030, 0x8, 0x82a0400, 0x0, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/store/store.go:295 +0x151
github.com/coreos/etcd/server/v2.GetHandler(0x1dc65640, 0x1b1138c0, 0x1aa0dd90, 0x1e48fe10, 0x18691540, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/v2/get_handler.go:49 +0x455
github.com/coreos/etcd/server.func·003(0x1dc65640, 0x1b1138c0, 0x1aa0dd90, 0x0, 0x84be0e0, ...)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/server.go:141 +0x66
github.com/coreos/etcd/server.func·004(0x1dc65640, 0x1b1138c0, 0x1aa0dd90)
/home/tprojadmin/etcd/src/github.com/coreos/etcd/server/server.go:162 +0x23b
net/http.HandlerFunc.ServeHTTP(0x1868bda0, 0x1dc65640, 0x1b1138c0, 0x1aa0dd90)
/usr/local/go/src/pkg/net/http/server.go:1149 +0x3c
github.com/gorilla/mux.(_Router).ServeHTTP(0x18666540, 0x1dc65640, 0x1b1138c0, 0x1aa0dd90)
/home/tprojadmin/etcd/src/github.com/gorilla/mux/mux.go:90 +0x193
net/http.serverHandler.ServeHTTP(0x18691540, 0x1dc65640, 0x1b1138c0, 0x1aa0dd90)
/usr/local/go/src/pkg/net/http/server.go:1517 +0x121
net/http.(_conn).serve(0x1f3d91e0)
/usr/local/go/src/pkg/net/http/server.go:1096 +0x653
created by net/http.(_Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1564 +0x242

Some one have a solution please ?

@marouane-kech Hmm... Are you on a 32bit machine? If it is, then seems like a "bug" in golang. I cannot reproduce your problem on my current machine. But I will find a 32bit machine to give a try.

We have a struct like

type watcherHub struct {
    watchers     map[string]*list.List
    count        int64 // current number of watchers.
    EventHistory *EventHistory
}

When you access count via 64bit operation, it panics.
I guess the problem is that the struct is not 64bit aligned on your machine.
Can you try to move count to the first like

type watcherHub struct {
    count        int64 // current number of watchers.
    watchers     map[string]*list.List
    EventHistory *EventHistory
}

@marouane-kech Oh. I found a similar issue on golang forum. https://code.google.com/p/go/issues/detail?id=6404. I am going to fix it now.

@xiangli-cmu Is this already fixed in go1.2 maybe?

On Mon, Dec 2, 2013 at 8:13 AM, Xiang Li notifications@github.com wrote:

@marouane-kech https://github.com/marouane-kech Oh. I found a similar
issue on golang forum. https://code.google.com/p/go/issues/detail?id=6404.
I am going to fix it now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/358#issuecomment-29630859
.

@xiangli-cmu Yes I have a 32bit machine, I do the change like you said in watcherHub struct (file watcher_hub.go) but I still have the same problem.
do you find a solution to this yet ? How can I fix it ?

Thank you.

@marouane-kech I will check it on a 32bit machine.

Good morning @xiangli-cmu Have you some news about watcher function on a 32bit machine.
Thank you.

I'm hitting this one as well. That's one of the most amazing bugs in golang I've seem so far, and I see lots of those recently.

It's possible to reorder some fields accessed by atomic where that's required. A possible fix could be seen at farcaller@1257e1c. If this is an acceptable solution for etcd, I can clean that up and send a PR.

@farcaller Is there an upstream bug for this alignment problem?

This is to remind you of this issue. We are hit by this as well. Upstream says
"On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically." and has the issue status set to "Unfortunate".
I would consider this as a bug in etcd on 32bit ARM and x86-32.
To fix it one would have to make sure that all 64bit values that must be changed atomically are 8byte aligned. farcaller's patch might achieve this.

commented

Also hitting this bug on i386. As @farcaller's patch is not yet included in current master, is there any hope this is going to be fixed / worked around in etcd soon?

@jsteemann @neunhoef @farcaller Since you have the hardware can one of you figure out the alignment of the structs that makes this work for you and send a patch? I am fine taking a patch as long as it doesn't have any functional changes. I will also have to rely on you all to test releases so we don't regress.

Thanks.

Based on my discussion at go-nuts, there are no plans to make atomic safer. So far, the safest way is:

  • make sure that any integer accessible by atomic is first in the struct;
  • if more than one such integer is required, they must be ordered from longest (in number of bits) to shortest; alternatively, always use int64 for atomics.

@farcaller I like your patch!
I think this is the best way to fix it for now.
For third party lib, we may need to send separate PR to upstream one.
But we could merge the watch hub one if it works. :)

commented

@unihorn @farcaller: I have tried farcaller's patch (farcaller@1257e1c) with ETCD 0.3 on a 32 bit Ubuntu and it worked fine.
With patch, the crashes when using watches do not seem to occur anymore.
I have also tried the patch on 64 bit Linuxes, and it did not do any harm there either.

I'd love to see farcaller's merged being merged into ETCD.

The patch covers three repos, and I think it could only be a workaround for watch command.
I have grep int64 in the repo, and found out tens of lines. If I want to use this solution fully, I have to make changes on about five structs, which break their original sequences. And other changes should be applied to third_party repo.
Is this acceptable? @xiangli-cmu @philips @bmizerany

This is a big change since it is going to touch a ton in third_party, which we don't own and don't have a good way to know what that is going to affect, I think.

commented

If this is issue is not going to be fixed, I think there should at least be some statement in ETCD's README.md like ETCD works on 64 bit operating systems only. Support for 32 bit operating systems may be added in the future.
Just to save others from trouble.

What do you think?

etcd does not support 32bit systems. We have updated our README to state the fact. Thanks.