test killed after 10min on travis with docker mongo
dvic opened this issue · comments
Hi,
Every once in a while our mongo suite gets killed on TravicCI. We run go 1.10 and use docker for our test suites. Our Postgres and Neo4j test suites run just fine with this setup but with mgo and Mongo we're having these issues.
Stacktrace information can be found below. Any idea why this is happening?
+go test -v -race -coverprofile=coverage.out -covermode=atomic ./...
=== RUN TestMongoSuiteWithoutCredentials
2018/03/04 13:50:01 CREATING NEW POOL
2018/03/04 13:50:01 POOL CREATED <nil>
2018/03/04 13:50:01 RUNNING MONGO CONTAINER
2018/03/04 13:50:11 MONGO CONTAINER CREATED <nil>
2018/03/04 13:50:11 BEFORE testConnect
2018/03/04 13:50:11 START DialWithTimeout
2018/03/04 13:50:11 MONGO URL = mongodb://localhost:32768
SIGQUIT: quit
PC=0x474643 m=0 sigcode=0
goroutine 31 [syscall]:
runtime.notetsleepg(0x12be9e0, 0x37e09133e, 0x16)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/lock_futex.go:227 +0x42 fp=0xc420052760 sp=0xc420052730 pc=0x422022
runtime.timerproc(0x12be9c0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:261 +0x2f9 fp=0xc4200527d8 sp=0xc420052760 pc=0x461889
runtime.goexit()
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200527e0 sp=0xc4200527d8 pc=0x472bd1
created by runtime.(*timersBucket).addtimerLocked
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:160 +0x107
goroutine 1 [chan receive]:
testing.(*T).Run(0xc42021c000, 0xd2d7a8, 0x20, 0xd41b80, 0xc4201e5c00)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:825 +0x597
testing.runTests.func1(0xc42021c000)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:1063 +0xa5
testing.tRunner(0xc42021c000, 0xc4201e5d48)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:777 +0x16e
testing.runTests(0xc4201378e0, 0x127b3e0, 0x1, 0x1, 0xc420160800)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:1061 +0x4e2
testing.(*M).Run(0xc420160800, 0x0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:978 +0x2ce
main.main()
_testmain.go:90 +0x325
goroutine 19 [syscall]:
os/signal.signal_recv(0x472bd1)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sigqueue.go:139 +0xa6
os/signal.loop()
/home/travis/.gimme/versions/go1.10.linux.amd64/src/os/signal/signal_unix.go:22 +0x30
created by os/signal.init.0
/home/travis/.gimme/versions/go1.10.linux.amd64/src/os/signal/signal_unix.go:28 +0x4f
goroutine 20 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000000)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x0, 0xc420240a01, 0x6fc23ac00, 0x6fc23ac00, 0x0, 0x0, 0x0, 0x1000, 0x1c5b320, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc4202409c0, 0xb9e201, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc42017bc20, 0xc2ee40, 0xda60b0, 0x0, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc4202409c0, 0xc2ee40, 0xda60b0, 0x0, 0x0, 0xcf84e0, 0xc42023a6c0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*Session).Ping(0xc4202409c0, 0xc42023a6c0, 0x6fc23ac00)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2299 +0x5d
github.com/globalsign/mgo.DialWithInfo(0xc4202c0000, 0x17, 0xc4202c0000, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:563 +0x566
github.com/globalsign/mgo.DialWithTimeout(0xc420026d20, 0x17, 0x6fc23ac00, 0x0, 0xc420167780, 0xc4200b0120)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:304 +0xc3
mongo_test.(*suite).testConnect(0xc42017bf48, 0xc42021c0f0)
/home/travis/build/qdentity/qdentity/go/src/mongo/mongo_test.go:36 +0xc8
mongo_test.TestMongoSuiteWithoutCredentials(0xc42021c0f0)
/home/travis/build/qdentity/qdentity/go/src/mongo/mongo_test.go:22 +0x187
testing.tRunner(0xc42021c0f0, 0xd41b80)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:777 +0x16e
created by testing.(*T).Run
/home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:824 +0x565
goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175
goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3
goroutine 12 [sleep]:
time.Sleep(0x37e11d600)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:102 +0x146
github.com/globalsign/mgo.(*mongoServer).pinger(0xc4202c00e0, 0x479801)
/home/travis/gopath/src/github.com/globalsign/mgo/server.go:314 +0x7ad
created by github.com/globalsign/mgo.newServer
/home/travis/gopath/src/github.com/globalsign/mgo/server.go:89 +0x24b
goroutine 34 [IO wait]:
internal/poll.runtime_pollWait(0x7f50f3494f00, 0x72, 0x128aff0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/netpoll.go:173 +0x5e
internal/poll.(*pollDesc).wait(0xc420234e18, 0x72, 0xda9f00, 0x128aff0, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_poll_runtime.go:85 +0xe5
internal/poll.(*pollDesc).waitRead(0xc420234e18, 0xc420028800, 0x24, 0x24)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_poll_runtime.go:90 +0x4b
internal/poll.(*FD).Read(0xc420234e00, 0xc420028840, 0x24, 0x24, 0x0, 0x0, 0x0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_unix.go:157 +0x22a
net.(*netFD).Read(0xc420234e00, 0xc420028840, 0x24, 0x24, 0x4ab9ed, 0xc420234e00, 0x0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/net/fd_unix.go:202 +0x66
net.(*conn).Read(0xc42000e0c8, 0xc420028840, 0x24, 0x24, 0x0, 0xc4202c24b0, 0xc420062dc0)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/net/net.go:176 +0x85
github.com/globalsign/mgo.fill(0xdb3660, 0xc42000e0c8, 0xc420028840, 0x24, 0x24, 0x0, 0x11)
/home/travis/gopath/src/github.com/globalsign/mgo/socket.go:567 +0x64
github.com/globalsign/mgo.(*mongoSocket).readLoop(0xc4202c24b0)
/home/travis/gopath/src/github.com/globalsign/mgo/socket.go:583 +0x15b
created by github.com/globalsign/mgo.newSocket
/home/travis/gopath/src/github.com/globalsign/mgo/socket.go:197 +0x341
rax 0xfffffffffffffffc
rbx 0x12bb3a0
rcx 0x474643
rdx 0x0
rdi 0x12be9e0
rsi 0x0
rbp 0xc4200526e8
rsp 0xc420052698
r8 0x0
r9 0x0
r10 0xc4200526d8
r11 0x202
r12 0xc420079c80
r13 0x12bb3a0
r14 0xc420001500
r15 0x1a354620
rip 0x474643
rflags 0x202
cs 0x33
fs 0x0
gs 0x0
*** Test killed with quit: ran too long (10m0s).
FAIL mongo 600.006s
Could it be a problem with the -race
flag? We removed the -race
flag and up to this point the tests have stopped failing.
I got a similar (but not identical) deadlock & backtrace when running TestConnectCloseConcurrency
. I think the main source of this problem are these two stacks:
goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175
and
goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3
As near as I can tell....
- Goroutine 11 has the
syncServersLoop
, which loops every few hundred ms and checks the topology of the cluster. syncServersLoop
callssyncServersIteration
to do its actual work on every pump of the loopsyncServersIteration
spawns a new goroutine 30 and blocks goroutine 11 waiting for 30 on async.waitGroup
- The anonymous function in
syncServersIteration
callscluster.syncServer()
to probe it and add it to thecluster.masters
andcluster.servers
slices. cluster.syncServer
explicitly opens a socket to this particular server with a call toserver.AcquireSocket
(as opposed to opening a socket to any server in the cluster)cluster.syncServer
callsserver.isMaster()
with this socket, to ask if the server is a replset masterisMaster
creates a new session and explicitly assigns the passed-in socket to it. It prepares a command and then attempts to execute it withsession.Run
- This eventually falls in to
Database.Run()
, which callssession.acquireSocket()
acquireSocket()
should be a no-op, since theisMaster
call a few frames above explicitly sets.setSocket
. However, it apparently fails the checks thats.masterSocket != nil && s.masterSocket.dead == nil
ors.slaveSocket != nil && s.slaveSocket.dead == nil && s.slaveOk && slaveOk && (s.masterSocket == nil || s.consistency != PrimaryPreferred && s.consistency != Monotonic)
, and thus falls intos.cluster().AcquireSocket()
. THIS is I believe the bug; the code higher up the stack is trying to callisMaster
on a particular server, but this is going to get a connection to any arbitrary server matching the tags.AcquireSocket
looks for a server in its understanding of the topology by checkingcluster.masters.Len()
andcluster.servers.Len()
. However, the cluster discovery hasn't actually run yet -syncServersIteration
(further up our call stack in this goroutine) is supposed to populate those collections with a call tocluster.addServer()
, but it needs to finish its call tosyncServer
/isMaster
first.- Since the cluster topology isn't populated yet,
AcquireSocket
attempts to poke thesyncServers
loop on goroutine 11 by callingcluster.syncServers
which just writes to a channel. This is actually a total no-op because both sides of the channel are read/written to nonblocking and the data is just a signal, but this is a different bug and not the actual issue. AcquireSocket
then waits on the condition variablecluster.serverSynced.Wait()
.- BUT, that condition variable is broadcast from three places:
syncServersLoop
, which is not iterating at the moment because goroutine 11 is blocked on the waitgroup insyncServersIteration
addServer
andsyncServer
, both of which are only called fromsyncServersIteration
, which we are blocking on goroutine 30
- Thus, we have a deadlock.
phew. That was fun.
I'm pretty sure the bug is that isMaster
is using session.setSocket
to ensure that the command with Run
is run against the right server, but if something is wrong with the socket, instead of passing an error up to isMaster
, Run
calls acquireSocket
which just attempts to make a new socket to any random server in the cluster. The deadlock is not a code path that should ever be made to work, I think.
Thoughts?
Hi @dvic and @KJTsanaktsidis
First off - @dvic thanks for the solid report, and @KJTsanaktsidis thanks for diving deeper into mgo than is good for your sanity!
We'll take a look at this - we've never seen any deadlocks ourselves but the possibility is definitely there - there's an amazing amount of interplay with the locks (as @KJTsanaktsidis can clearly attest!) Do either of you have any reproducing code we can look at?
Dom
I’ll have a look and see if I can find a solid reproduction next week - maybe a “mongo” server that accepts then closes all connections might trigger this code path?
@domodwyer I think I've managed to provide a repro in #121 - the test in the first commit fails about 20% of the time when i run it with go test -check.v -check.f "S.TestNoDeadlockOnClose" -timeout 25s
on my machine.
Hi @dvic
We're going to merge #121 into development ASAP (thanks to @KJTsanaktsidis !) and cut a hotfix to master once it's tested. In the meantime would you be able to run your tests using the development mgo branch to check if it resolves this issue?
Dom
Hi @domodwyer, sure no problem. Thanks! Will try it now and get back to you.
No problem, for now I just used https://github.com/zendesk/mgo/tree/fix_dial_deadlock directly, TravisCI is running.. 🤞
Good news: I ran the test suite three times now, each passed without problems 👍 I'll keep them running just to be sure and I can also run it a few times on the dev branch once you're ready.
@domodwyer Tests keep passing, #121 definitely seems to solve the problem (for me at least). Let me know if you want me to perform additional test runs on the dev branch.
This is great news - thanks @dvic for reporting and @KJTsanaktsidis for such a comprehensive analysis and fix! Open source communities are alive and well! 👍
I will close this after the hotfix - thanks a lot!
Dom
Really happy to help - having this library be actively maintained helps everyone!
Hi @dvic, @KJTsanaktsidis
Sorry for disappearing, I was out the country! It looks like this has been fixed (thanks!) but with a direct push to development so this didn't close (I'll also find out how that happened - it should be PR only) so closing now.
I will cut a hotfix release after a test run - thanks again!
Dom