etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system

Home Page:https://etcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

test: TestIssue2746

xiang90 opened this issue · comments

=== RUN   TestIssue2746
--- FAIL: TestIssue2746 (1.67s)
    cluster_test.go:360: #1: watch on http://127.0.0.1:20114 error: client: etcd cluster is unavailable or misconfigured

Not able to reproduce... Will try more...

Still reproducible (less than 1%) with the latest version (d32113a) on my machine (Xeon E3, 4 cores)

@AkihiroSuda

Can you type assert that error to client.ClusterError and print out its detail? (https://github.com/coreos/etcd/blob/master/client/cluster_error.go#L19-L33)

I got this ClusterError.

--- FAIL: TestIssue2746 (6.36s)
        cluster_test.go:351: create on http://127.0.0.1:20950 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: read tcp 127.0.0.1:49676->127.0.0.1:20950: i/o timeout

Note that this error is raised from a slightly different point than a original point.

diff --git a/integration/cluster_test.go b/integration/cluster_test.go
index 4d7e9e0..c1be43d 100644
--- a/integration/cluster_test.go
+++ b/integration/cluster_test.go
@@ -347,7 +347,8 @@ func clusterMustProgress(t *testing.T, membs []*member) {
        key := fmt.Sprintf("foo%d", rand.Int())
        resp, err := kapi.Create(ctx, "/"+key, "bar")
        if err != nil {
-               t.Fatalf("create on %s error: %v", membs[0].URL(), err)
+               cerr := err.(*client.ClusterError)
+               t.Fatalf("create on %s error: %v(detail: %s)", membs[0].URL(), err, cerr.Detail())
        }
        cancel()

@@ -357,7 +358,9 @@ func clusterMustProgress(t *testing.T, membs []*member) {
                mkapi := client.NewKeysAPI(mcc)
                mctx, mcancel := context.WithTimeout(context.Background(), requestTimeout)
                if _, err := mkapi.Watcher(key, &client.WatcherOptions{AfterIndex: resp.Node.ModifiedIndex - 1}).Next(mctx); err != nil {
-                       t.Fatalf("#%d: watch on %s error: %v", i, u, err)
+                       cerr := err.(*client.ClusterError)
+                       t.Fatalf("#%d: watch on %s error: %v(detail: %s)", i, u, err, cerr.Detail())
+
                }
                mcancel()
        }

@heyitsanthony Can you take this over? I cannot reproduce this on my local machine :(. Thanks!

ETCD_ELECTION_TIMEOUT_TICKS wasn't set in semaphore like travis so it was triggering a new election which was causing the lost leader to drop messages. I tried to repro with the election ticks set to 600 and it seemed to work OK. Updated semaphore and marking this as closed.