etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system

Home Page:https://etcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Revision decreasing after panic during compaction

serathius opened this issue · comments

Bug report criteria

What happened?

Failure in https://github.com/etcd-io/etcd/actions/runs/8659974818

image

What did you expect to happen?

Revision doesn't decrease

How can we reproduce it (as minimally and precisely as possible)?

Follow https://github.com/etcd-io/etcd/tree/main/tests/robustness#re-evaluate-existing-report to validate report from https://github.com/etcd-io/etcd/actions/runs/8659974818

Run TestRobustnessExploratory/Kubernetes/LowTraffic/ClusterOfSize1 with failpoint compactBeforeSetFinishedCompact=panic()

Anything else we need to know?

TODO:

  • Try to reproduce the issue
  • Try to reproduce the issue on amd64

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Hmm, still no repro. Are we so unlucky that we hit a bit flip or broken machine?

I was not able to repro it on my mac (arm64) with 500 runs and failpoint compactBeforeSetFinishedCompact=panic()

commented

Not be able to download the log

The reproduce attempts on linux amd64 machine is also not successful.

We need a persisted audit log to easily replay the traffic. Turn on LogUnaryInterceptor could be a good start.

Not be able to download the log

What do you mean? I was able to download main-arm64.zip from https://github.com/etcd-io/etcd/actions/runs/8659974818 without any problem.

I think we need to assume that this was a hardware issue. One last thing to confirm. Could someone check the bbolt file from the report? Would be a good sanity check that the revision decrease really happen.

@serathius i was able to download and have shared it with Chao (took a few retries .. spotty networking)

commented
image

Somehow I cannot download from the github UI.. Retried multiple times.

dev-dsk-chaochn-2c-a26acd76 % unzip logs_22693293820.zip
Archive:  logs_22693293820.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of logs_22693293820.zip or
        logs_22693293820.zip.zip, and cannot find logs_22693293820.zip.ZIP, period.
commented

@serathius Just to confirm, is it something you were looking for?

I did not observe revision number is decreased but there is some gap between them.

scheduledCompactRev was set to 301 though as expected.

dev-dsk-chaochn-2c-a26acd76 % etcd-dump-db iterate-bucket testdata/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/server-TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0/member/snap/db key --decode=true
rev={main:470 sub:0}, value=[key "/registry/pods/default/HcN6F" | val "343" | created 470 | mod 470 | ver 1]
rev={main:469 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "340" | created 132 | mod 469 | ver 3]
rev={main:468 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "337" | created 65 | mod 468 | ver 6]
rev={main:467 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "333" | created 57 | mod 467 | ver 4]
rev={main:466 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "332" | created 101 | mod 466 | ver 5]
rev={main:465 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "330" | created 66 | mod 465 | ver 6]
rev={main:464 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "328" | created 177 | mod 464 | ver 4]
rev={main:463 sub:0}, value=[key "/registry/pods/default/oitZs" | val "327" | created 152 | mod 463 | ver 3]
rev={main:462 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "326" | created 42 | mod 462 | ver 6]
rev={main:461 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "325" | created 36 | mod 461 | ver 6]
rev={main:460 sub:0}, value=[key "/registry/pods/default/loies" | val "324" | created 185 | mod 460 | ver 8]
rev={main:459 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "323" | created 112 | mod 459 | ver 5]
rev={main:458 sub:0}, value=[key "/registry/pods/default/mFTeh" | val "322" | created 458 | mod 458 | ver 1]
rev={main:457 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "321" | created 43 | mod 457 | ver 5]
rev={main:456 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "320" | created 43 | mod 456 | ver 4]
rev={main:455 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "319" | created 66 | mod 455 | ver 5]
rev={main:454 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "318" | created 65 | mod 454 | ver 5]
rev={main:453 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "317" | created 36 | mod 453 | ver 5]
rev={main:452 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "316" | created 112 | mod 452 | ver 4]
rev={main:451 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "315" | created 438 | mod 451 | ver 4]
rev={main:450 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "314" | created 79 | mod 450 | ver 5]
rev={main:449 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "313" | created 36 | mod 449 | ver 4]
rev={main:448 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "312" | created 119 | mod 448 | ver 5]
rev={main:447 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "311" | created 132 | mod 447 | ver 2]
rev={main:446 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "310" | created 438 | mod 446 | ver 3]
rev={main:445 sub:0}, value=[key "/registry/pods/default/loies" | val "309" | created 185 | mod 445 | ver 7]
rev={main:444 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "308" | created 101 | mod 444 | ver 4]
rev={main:443 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "307" | created 42 | mod 443 | ver 5]
rev={main:442 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "306" | created 119 | mod 442 | ver 4]
rev={main:441 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "305" | created 438 | mod 441 | ver 2]
rev={main:440 sub:0}, value=[key "/registry/pods/default/loies" | val "304" | created 185 | mod 440 | ver 6]
rev={main:439 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "303" | created 52 | mod 439 | ver 4]
rev={main:438 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "302" | created 438 | mod 438 | ver 1]
rev={main:437 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "301" | created 79 | mod 437 | ver 4]
rev={main:436 sub:0}, value=[key "/registry/pods/default/oitZs" | val "300" | created 152 | mod 436 | ver 2]
rev={main:435 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "299" | created 101 | mod 435 | ver 3]
rev={main:434 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "298" | created 36 | mod 434 | ver 3]
rev={main:433 sub:0}, value=[key "/registry/pods/default/loies" | val "297" | created 185 | mod 433 | ver 5]
rev={main:432 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "296" | created 65 | mod 432 | ver 4]
rev={main:431 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "295" | created 52 | mod 431 | ver 3]
rev={main:430 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "294" | created 42 | mod 430 | ver 4]
rev={main:429 sub:0}, value=[key "/registry/pods/default/b4liI" | val "" | created 0 | mod 0 | ver 0]
rev={main:428 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "" | created 0 | mod 0 | ver 0]
rev={main:427 sub:0}, value=[key "/registry/pods/default/pTNFe" | val "" | created 0 | mod 0 | ver 0]
rev={main:426 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "" | created 0 | mod 0 | ver 0]
rev={main:425 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "" | created 0 | mod 0 | ver 0]
rev={main:424 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "" | created 0 | mod 0 | ver 0]
rev={main:423 sub:0}, value=[key "/registry/pods/default/lmO6M" | val "" | created 0 | mod 0 | ver 0]
rev={main:422 sub:0}, value=[key "/registry/pods/default/ydYO1" | val "" | created 0 | mod 0 | ver 0]
rev={main:421 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "293" | created 43 | mod 421 | ver 3]
rev={main:420 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "" | created 0 | mod 0 | ver 0]
rev={main:419 sub:0}, value=[key "/registry/pods/default/63R6m" | val "" | created 0 | mod 0 | ver 0]
rev={main:418 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "" | created 0 | mod 0 | ver 0]
rev={main:417 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "" | created 0 | mod 0 | ver 0]
rev={main:416 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "" | created 0 | mod 0 | ver 0]
rev={main:415 sub:0}, value=[key "/registry/pods/default/Hz305" | val "" | created 0 | mod 0 | ver 0]
rev={main:414 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "" | created 0 | mod 0 | ver 0]
rev={main:413 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "" | created 0 | mod 0 | ver 0]
rev={main:412 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "292" | created 77 | mod 412 | ver 3]
rev={main:411 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "" | created 0 | mod 0 | ver 0]
rev={main:410 sub:0}, value=[key "/registry/pods/default/loies" | val "291" | created 185 | mod 410 | ver 4]
rev={main:409 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "" | created 0 | mod 0 | ver 0]
rev={main:408 sub:0}, value=[key "/registry/pods/default/UeZqt" | val "" | created 0 | mod 0 | ver 0]
rev={main:407 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "" | created 0 | mod 0 | ver 0]
rev={main:406 sub:0}, value=[key "/registry/pods/default/65uJD" | val "" | created 0 | mod 0 | ver 0]
rev={main:405 sub:0}, value=[key "/registry/pods/default/1u7se" | val "" | created 0 | mod 0 | ver 0]
rev={main:404 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "" | created 0 | mod 0 | ver 0]
rev={main:403 sub:0}, value=[key "/registry/pods/default/CwjdS" | val "" | created 0 | mod 0 | ver 0]
rev={main:402 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "" | created 0 | mod 0 | ver 0]
rev={main:401 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "" | created 0 | mod 0 | ver 0]
rev={main:400 sub:0}, value=[key "/registry/pods/default/31cCM" | val "" | created 0 | mod 0 | ver 0]
rev={main:399 sub:0}, value=[key "/registry/pods/default/SF9kd" | val "" | created 0 | mod 0 | ver 0]
rev={main:398 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "" | created 0 | mod 0 | ver 0]
rev={main:397 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "" | created 0 | mod 0 | ver 0]
rev={main:396 sub:0}, value=[key "/registry/pods/default/ytwZZ" | val "" | created 0 | mod 0 | ver 0]
rev={main:395 sub:0}, value=[key "/registry/pods/default/obin7" | val "" | created 0 | mod 0 | ver 0]
rev={main:394 sub:0}, value=[key "/registry/pods/default/7a7Gt" | val "" | created 0 | mod 0 | ver 0]
rev={main:393 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "" | created 0 | mod 0 | ver 0]
rev={main:392 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "" | created 0 | mod 0 | ver 0]
rev={main:391 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "" | created 0 | mod 0 | ver 0]
rev={main:390 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "" | created 0 | mod 0 | ver 0]
rev={main:389 sub:0}, value=[key "/registry/pods/default/MiZ8L" | val "" | created 0 | mod 0 | ver 0]
rev={main:388 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "" | created 0 | mod 0 | ver 0]
rev={main:387 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "" | created 0 | mod 0 | ver 0]
rev={main:386 sub:0}, value=[key "/registry/pods/default/loies" | val "290" | created 185 | mod 386 | ver 3]
rev={main:385 sub:0}, value=[key "/registry/pods/default/lthiH" | val "" | created 0 | mod 0 | ver 0]
rev={main:384 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "289" | created 52 | mod 384 | ver 2]
rev={main:383 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "288" | created 107 | mod 383 | ver 2]
rev={main:382 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "287" | created 177 | mod 382 | ver 3]
rev={main:381 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "286" | created 79 | mod 381 | ver 3]
rev={main:380 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "285" | created 90 | mod 380 | ver 3]
rev={main:379 sub:0}, value=[key "/registry/pods/default/5k2u5" | val "" | created 0 | mod 0 | ver 0]
rev={main:378 sub:0}, value=[key "/registry/pods/default/b4liI" | val "283" | created 91 | mod 378 | ver 4]
rev={main:377 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "282" | created 66 | mod 377 | ver 4]
rev={main:376 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "281" | created 98 | mod 376 | ver 3]
rev={main:375 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "280" | created 82 | mod 375 | ver 3]
rev={main:374 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "279" | created 57 | mod 374 | ver 3]
rev={main:373 sub:0}, value=[key "/registry/pods/default/63R6m" | val "278" | created 93 | mod 373 | ver 3]
rev={main:372 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "" | created 0 | mod 0 | ver 0]
rev={main:371 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "277" | created 42 | mod 371 | ver 3]
rev={main:370 sub:0}, value=[key "/registry/pods/default/31cCM" | val "276" | created 370 | mod 370 | ver 1]
rev={main:369 sub:0}, value=[key "/registry/pods/default/lthiH" | val "275" | created 178 | mod 369 | ver 3]
rev={main:368 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "274" | created 327 | mod 368 | ver 2]
rev={main:367 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "273" | created 27 | mod 367 | ver 3]
rev={main:366 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "272" | created 79 | mod 366 | ver 2]
rev={main:365 sub:0}, value=[key "/registry/pods/default/65uJD" | val "271" | created 136 | mod 365 | ver 6]
rev={main:364 sub:0}, value=[key "/registry/pods/default/jQJaP" | val "" | created 0 | mod 0 | ver 0]
rev={main:363 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "269" | created 119 | mod 363 | ver 3]
rev={main:362 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "268" | created 101 | mod 362 | ver 2]
rev={main:361 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "267" | created 66 | mod 361 | ver 3]
rev={main:360 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "266" | created 33 | mod 360 | ver 3]
rev={main:359 sub:0}, value=[key "/registry/pods/default/obin7" | val "265" | created 141 | mod 359 | ver 3]
rev={main:358 sub:0}, value=[key "/registry/pods/default/xtr4a" | val "" | created 0 | mod 0 | ver 0]
rev={main:357 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "263" | created 189 | mod 357 | ver 3]
rev={main:356 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "262" | created 129 | mod 356 | ver 2]
rev={main:355 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "261" | created 65 | mod 355 | ver 3]
rev={main:354 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "259" | created 16 | mod 354 | ver 3]
rev={main:353 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "260" | created 121 | mod 353 | ver 4]
rev={main:352 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "258" | created 20 | mod 352 | ver 2]
rev={main:351 sub:0}, value=[key "/registry/pods/default/nyn3R" | val "" | created 0 | mod 0 | ver 0]
rev={main:350 sub:0}, value=[key "/registry/pods/default/1u7se" | val "257" | created 191 | mod 350 | ver 3]
rev={main:349 sub:0}, value=[key "/registry/pods/default/b4liI" | val "256" | created 91 | mod 349 | ver 3]
rev={main:348 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "255" | created 116 | mod 348 | ver 3]
rev={main:347 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "254" | created 112 | mod 347 | ver 3]
rev={main:346 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "253" | created 42 | mod 346 | ver 2]
rev={main:345 sub:0}, value=[key "/registry/pods/default/65uJD" | val "252" | created 136 | mod 345 | ver 5]
rev={main:344 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "251" | created 28 | mod 344 | ver 2]
rev={main:343 sub:0}, value=[key "/registry/pods/default/oNdAT" | val "" | created 0 | mod 0 | ver 0]
rev={main:342 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "249" | created 65 | mod 342 | ver 2]
rev={main:341 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "248" | created 16 | mod 341 | ver 2]
rev={main:340 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "247" | created 27 | mod 340 | ver 2]
rev={main:339 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "246" | created 140 | mod 339 | ver 3]
rev={main:338 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "245" | created 90 | mod 338 | ver 2]
rev={main:337 sub:0}, value=[key "/registry/pods/default/1u7se" | val "244" | created 191 | mod 337 | ver 2]
rev={main:336 sub:0}, value=[key "/registry/pods/default/7vXeA" | val "" | created 0 | mod 0 | ver 0]
rev={main:335 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "243" | created 121 | mod 335 | ver 3]
rev={main:334 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "242" | created 36 | mod 334 | ver 2]
rev={main:333 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "240" | created 63 | mod 333 | ver 2]
rev={main:332 sub:0}, value=[key "/registry/pods/default/65uJD" | val "241" | created 136 | mod 332 | ver 4]
rev={main:331 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "239" | created 116 | mod 331 | ver 2]
rev={main:330 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "238" | created 106 | mod 330 | ver 2]
rev={main:329 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "237" | created 47 | mod 329 | ver 2]
rev={main:328 sub:0}, value=[key "/registry/pods/default/WVd2B" | val "" | created 0 | mod 0 | ver 0]
rev={main:327 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "236" | created 327 | mod 327 | ver 1]
rev={main:326 sub:0}, value=[key "/registry/pods/default/loies" | val "234" | created 185 | mod 326 | ver 2]
rev={main:325 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "233" | created 43 | mod 325 | ver 2]
rev={main:324 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "231" | created 189 | mod 324 | ver 2]
rev={main:323 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "230" | created 57 | mod 323 | ver 2]
rev={main:322 sub:0}, value=[key "/registry/pods/default/JtjI4" | val "" | created 0 | mod 0 | ver 0]
rev={main:321 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "229" | created 66 | mod 321 | ver 2]
rev={main:320 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "228" | created 98 | mod 320 | ver 2]
rev={main:319 sub:0}, value=[key "/registry/pods/default/b4liI" | val "227" | created 91 | mod 319 | ver 2]
rev={main:318 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "226" | created 138 | mod 318 | ver 2]
rev={main:317 sub:0}, value=[key "/registry/pods/default/63R6m" | val "225" | created 93 | mod 317 | ver 2]
rev={main:316 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "224" | created 78 | mod 316 | ver 2]
rev={main:315 sub:0}, value=[key "/registry/pods/default/h8XO2" | val "" | created 0 | mod 0 | ver 0]
rev={main:314 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "223" | created 256 | mod 314 | ver 2]
rev={main:313 sub:0}, value=[key "/registry/pods/default/lthiH" | val "222" | created 178 | mod 313 | ver 2]
rev={main:312 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "221" | created 33 | mod 312 | ver 2]
rev={main:311 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "219" | created 177 | mod 311 | ver 2]
rev={main:310 sub:0}, value=[key "/registry/pods/default/sEkjd" | val "" | created 0 | mod 0 | ver 0]
rev={main:309 sub:0}, value=[key "/registry/pods/default/65uJD" | val "218" | created 136 | mod 309 | ver 3]
rev={main:308 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "217" | created 83 | mod 308 | ver 2]
rev={main:307 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "216" | created 119 | mod 307 | ver 2]
rev={main:306 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "215" | created 251 | mod 306 | ver 3]
rev={main:305 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "213" | created 82 | mod 305 | ver 2]
rev={main:304 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "212" | created 121 | mod 304 | ver 2]
rev={main:303 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "210" | created 77 | mod 303 | ver 2]
rev={main:302 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "211" | created 55 | mod 302 | ver 2]
rev={main:301 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "207" | created 112 | mod 301 | ver 2]
rev={main:300 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "209" | created 140 | mod 300 | ver 2]
rev={main:299 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "208" | created 251 | mod 299 | ver 2]
rev={main:298 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "" | created 0 | mod 0 | ver 0]
rev={main:297 sub:0}, value=[key "/registry/pods/default/obin7" | val "200" | created 141 | mod 297 | ver 2]
rev={main:262 sub:0}, value=[key "/registry/pods/default/65uJD" | val "199" | created 136 | mod 262 | ver 2]
rev={main:256 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "198" | created 256 | mod 256 | ver 1]
rev={main:251 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "197" | created 251 | mod 251 | ver 1]
rev={main:191 sub:0}, value=[key "/registry/pods/default/1u7se" | val "196" | created 191 | mod 191 | ver 1]
rev={main:190 sub:0}, value=[key "/registry/pods/default/7vXeA" | val "195" | created 190 | mod 190 | ver 1]
rev={main:189 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "194" | created 189 | mod 189 | ver 1]
rev={main:187 sub:0}, value=[key "/registry/pods/default/oNdAT" | val "193" | created 187 | mod 187 | ver 1]
rev={main:185 sub:0}, value=[key "/registry/pods/default/loies" | val "189" | created 185 | mod 185 | ver 1]
rev={main:180 sub:0}, value=[key "/registry/pods/default/lmO6M" | val "183" | created 180 | mod 180 | ver 1]
rev={main:179 sub:0}, value=[key "/registry/pods/default/h8XO2" | val "182" | created 179 | mod 179 | ver 1]
rev={main:178 sub:0}, value=[key "/registry/pods/default/lthiH" | val "181" | created 178 | mod 178 | ver 1]
rev={main:177 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "180" | created 177 | mod 177 | ver 1]
rev={main:171 sub:0}, value=[key "/registry/pods/default/Hz305" | val "173" | created 171 | mod 171 | ver 1]
rev={main:165 sub:0}, value=[key "/registry/pods/default/UeZqt" | val "167" | created 165 | mod 165 | ver 1]
rev={main:152 sub:0}, value=[key "/registry/pods/default/oitZs" | val "153" | created 152 | mod 152 | ver 1]
rev={main:143 sub:0}, value=[key "/registry/pods/default/sEkjd" | val "144" | created 143 | mod 143 | ver 1]
rev={main:140 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "140" | created 140 | mod 140 | ver 1]
rev={main:139 sub:0}, value=[key "/registry/pods/default/7a7Gt" | val "139" | created 139 | mod 139 | ver 1]
rev={main:138 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "138" | created 138 | mod 138 | ver 1]
rev={main:132 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "131" | created 132 | mod 132 | ver 1]
rev={main:130 sub:0}, value=[key "/registry/pods/default/ydYO1" | val "128" | created 130 | mod 130 | ver 1]
rev={main:129 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "127" | created 129 | mod 129 | ver 1]
rev={main:125 sub:0}, value=[key "/registry/pods/default/xtr4a" | val "117" | created 125 | mod 125 | ver 1]
rev={main:121 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "123" | created 121 | mod 121 | ver 1]
rev={main:119 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "122" | created 119 | mod 119 | ver 1]
rev={main:116 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "114" | created 116 | mod 116 | ver 1]
rev={main:114 sub:0}, value=[key "/registry/pods/default/jQJaP" | val "112" | created 114 | mod 114 | ver 1]
rev={main:112 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "110" | created 112 | mod 112 | ver 1]
rev={main:107 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "105" | created 107 | mod 107 | ver 1]
rev={main:106 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "104" | created 106 | mod 106 | ver 1]
rev={main:102 sub:0}, value=[key "/registry/pods/default/SF9kd" | val "99" | created 102 | mod 102 | ver 1]
rev={main:101 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "98" | created 101 | mod 101 | ver 1]
rev={main:100 sub:0}, value=[key "/registry/pods/default/MiZ8L" | val "97" | created 100 | mod 100 | ver 1]
rev={main:98 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "95" | created 98 | mod 98 | ver 1]
rev={main:94 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "91" | created 94 | mod 94 | ver 1]
rev={main:93 sub:0}, value=[key "/registry/pods/default/63R6m" | val "86" | created 93 | mod 93 | ver 1]
rev={main:91 sub:0}, value=[key "/registry/pods/default/b4liI" | val "87" | created 91 | mod 91 | ver 1]
rev={main:90 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "89" | created 90 | mod 90 | ver 1]
rev={main:87 sub:0}, value=[key "/registry/pods/default/JtjI4" | val "84" | created 87 | mod 87 | ver 1]
rev={main:83 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "80" | created 83 | mod 83 | ver 1]
rev={main:82 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "79" | created 82 | mod 82 | ver 1]
rev={main:79 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "76" | created 79 | mod 79 | ver 1]
rev={main:78 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "75" | created 78 | mod 78 | ver 1]
rev={main:77 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "73" | created 77 | mod 77 | ver 1]
rev={main:75 sub:0}, value=[key "/registry/pods/default/pTNFe" | val "71" | created 75 | mod 75 | ver 1]
rev={main:72 sub:0}, value=[key "/registry/pods/default/CwjdS" | val "68" | created 72 | mod 72 | ver 1]
rev={main:66 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "62" | created 66 | mod 66 | ver 1]
rev={main:65 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "61" | created 65 | mod 65 | ver 1]
rev={main:63 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "59" | created 63 | mod 63 | ver 1]
rev={main:57 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "54" | created 57 | mod 57 | ver 1]
rev={main:55 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "52" | created 55 | mod 55 | ver 1]
rev={main:52 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "50" | created 52 | mod 52 | ver 1]
rev={main:47 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "45" | created 47 | mod 47 | ver 1]
rev={main:43 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "42" | created 43 | mod 43 | ver 1]
rev={main:42 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "41" | created 42 | mod 42 | ver 1]
rev={main:40 sub:0}, value=[key "/registry/pods/default/nyn3R" | val "39" | created 14 | mod 40 | ver 2]
rev={main:38 sub:0}, value=[key "/registry/pods/default/5k2u5" | val "16" | created 38 | mod 38 | ver 1]
rev={main:37 sub:0}, value=[key "/registry/pods/default/ytwZZ" | val "37" | created 37 | mod 37 | ver 1]
rev={main:36 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "36" | created 36 | mod 36 | ver 1]
rev={main:33 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "33" | created 33 | mod 33 | ver 1]
rev={main:28 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "29" | created 28 | mod 28 | ver 1]
rev={main:27 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "28" | created 27 | mod 27 | ver 1]
rev={main:23 sub:0}, value=[key "/registry/pods/default/WVd2B" | val "24" | created 23 | mod 23 | ver 1]
rev={main:20 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "21" | created 20 | mod 20 | ver 1]
rev={main:16 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "17" | created 16 | mod 16 | ver 1]
dev-dsk-chaochn-2c-a26acd76 % etcd-dump-db iterate-bucket testdata/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/server-TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0/member/snap/db meta --decode=true
key="term", value=3
key="storageVersion", value="3.6.0"
key="scheduledCompactRev", value={301 0}
key="consistent_index", value=539
key="confState", value="{\"voters\":[14578408409545168728],\"auto_leave\":false}"

spoke too soon, multiple laptops, different browsers, same issue (zip corrupted)

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

Will try to increase number of deletes to reproduce a similar case. Looking at the results from couple of runs, the deletes are pretty rare in Kubernetes traffic, possibly request type picker is broken.

commented

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

Trying to interpret the statement, the difference between the db file key bucket layout and the recorded response is due to compact & defrag after resuming from the panic, right?

But it does not make sense deletion returns response with revision 298 with key /regstry/pods/default/o95Cz but in the db file 298 corresponds to /registry/pods/default/vZJa5. Either the db file is wrong or the etcd responds with incorrect revision number to client.

297 is correct and it is reflected as put in the db file.

rev={main:301 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "207" | created 112 | mod 301 | ver 2]
rev={main:300 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "209" | created 140 | mod 300 | ver 2]
rev={main:299 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "208" | created 251 | mod 299 | ver 2]
rev={main:298 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "" | created 0 | mod 0 | ver 0]
rev={main:297 sub:0}, value=[key "/registry/pods/default/obin7" | val "200" | created 141 | mod 297 | ver 2]
if(mod_rev(/registry/pods/default/NYpLu)==30).
  then(delete("/registry/pods/default/NYpLu")).
  else(get("/registry/pods/default/NYpLu")) -> success(deleted: 1), rev:296

if(mod_rev(/registry/pods/default/obin7)==141).
  then(put("/registry/pods/default/obin7", "200")).
  else(get("/registry/pods/default/obin7")) -> success(ok), rev: 297

if(mod_rev(/regstry/pods/default/o95Cz)==133).
  then(delete("/registry/pods/default/o95Cz")).
  else(get("/registry/pods/default/o95Cz")) -> success(deleted: 1), rev: 298

if(mod_rev(/registry/pods/default/OWSx2)==137).
  then(delete("/registry/pods/default/OWSx2")).
  else(get("/registry/pods/default/OWSx2")) -> success(deleted: 1), rev: 299

if(mod_rev(/registry/pods/default/V8bKj)==53).
  then(delete("/registry/pods/default/V8bKj")).
  else(get("/registry/pods/default/V8bKj")) -> success(deleted: 1), rev: 300

if(mod_rev(/registry/pods/default/Z68rD)==156).
  then(delete("/registry/pods/default/Z68rD")).
  else(get("/registry/pods/default/Z68rD")) -> success(deleted: 1), rev: 301

Compact -> `scheduledCompactRev` was set to `301`

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

The workflow doesn't have any problem. Compact + defragmentation won't remove the last revision, even it's a tombstone revision.

But it does not make sense deletion returns response with revision 298 with key /regstry/pods/default/o95Cz but in the db file 298 corresponds to /registry/pods/default/vZJa5. Either the db file is wrong or the etcd responds with incorrect revision number to client.

The bbolt db file should be correct.

It looks like that the report isn't consistent with regards to revision 298. So it's a bug of the test itself?

  • In client-1/watch.json
{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/o95Cz","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":null}],"IsProgressNotify":false,"Revision":298,"Time":919365280},
  • In client-3/watch.json
{"Request":{"Key":"/registry/pods/","Revision":298,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true},"Responses":[{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/vZJa5","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":{"Value":{"Value":"91","Hash":0},"ModR    evision":94}}],"IsProgressNotify":false,"Revision":298,"Time":2073785080}
  • In client-4/watch.json
"Request":{"Key":"/registry/pods/","Revision":298,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true},"Responses":[{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/vZJa5","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":{"Value":{"Value":"91","Hash":0},"ModRevision":94}}

It looks like that the report isn't consistent with regards to revision 298. So it's a bug of the test itself?

Report is be just a dump of request and events observed by client. Different events for the same revision in watch means that different watch request observed different results. This is expected if there was a issue with KV store, so we don't even validate watch as it's expect results will be broken.

Dump from WAL logs confirms that inconsistency really happen. It includes operations that were observed by client with the same revision.

term         index      type    data
   2           334      norm    header:<ID:15517309547693860485 > txn:<compare:<target:MOD key:"/registry/pods/default/NYpLu" mod_revision:30 > success:<request_delete_range:<key:"/registry/pods/default/NYpLu" > > failure:<request_range:<key:"/registry/pods/default/NYpLu" > > > 
   2           335      norm    header:<ID:15517309547693860486 > txn:<compare:<target:MOD key:"/registry/pods/default/obin7" mod_revision:141 > success:<request_put:<key:"/registry/pods/default/obin7" value:"200" > > failure:<request_range:<key:"/registry/pods/default/obin7" > > > 
   2           336      norm    header:<ID:15517309547693860487 > txn:<compare:<target:MOD key:"/registry/pods/default/o95Cz" mod_revision:133 > success:<request_delete_range:<key:"/registry/pods/default/o95Cz" > > failure:<request_range:<key:"/registry/pods/default/o95Cz" > > > 
   2           337      norm    header:<ID:15517309547693860491 > txn:<compare:<target:MOD key:"/registry/pods/default/OWSx2" mod_revision:137 > success:<request_delete_range:<key:"/registry/pods/default/OWSx2" > > failure:<request_range:<key:"/registry/pods/default/OWSx2" > > > 
   2           338      norm    header:<ID:15517309547693860495 > txn:<compare:<target:MOD key:"/registry/pods/default/WjVep" mod_revision:144 > success:<request_delete_range:<key:"/registry/pods/default/WjVep" > > failure:<request_range:<key:"/registry/pods/default/WjVep" > > > 
   2           339      norm    header:<ID:15517309547693860496 > txn:<compare:<target:MOD key:"/registry/pods/default/geATU" mod_revision:71 > success:<request_delete_range:<key:"/registry/pods/default/geATU" > > failure:<request_range:<key:"/registry/pods/default/geATU" > > > 
   2           340      norm    header:<ID:15517309547693860499 > txn:<compare:<target:MOD key:"/registry/pods/default/FFgJw" mod_revision:127 > success:<request_delete_range:<key:"/registry/pods/default/FFgJw" > > failure:<request_range:<key:"/registry/pods/default/FFgJw" > > > 
   2           341      norm    header:<ID:15517309547693860500 > txn:<compare:<target:MOD key:"/registry/pods/default/SP2m4" mod_revision:95 > success:<request_delete_range:<key:"/registry/pods/default/SP2m4" > > failure:<request_range:<key:"/registry/pods/default/SP2m4" > > > 
   2           342      norm    header:<ID:15517309547693860501 > txn:<compare:<target:MOD key:"/registry/pods/default/V8bKj" mod_revision:53 > success:<request_delete_range:<key:"/registry/pods/default/V8bKj" > > failure:<request_range:<key:"/registry/pods/default/V8bKj" > > > 
   2           343      norm    header:<ID:15517309547693860502 > txn:<compare:<target:MOD key:"/registry/pods/default/Z68rD" mod_revision:156 > success:<request_delete_range:<key:"/registry/pods/default/Z68rD" > > failure:<request_range:<key:"/registry/pods/default/Z68rD" > > > 
   2           344      norm    header:<ID:15517309547693860503 > txn:<compare:<target:MOD key:"/registry/pods/default/EE7tw" mod_revision:92 > success:<request_delete_range:<key:"/registry/pods/default/EE7tw" > > failure:<request_range:<key:"/registry/pods/default/EE7tw" > > > 
   2           345      norm    header:<ID:15517309547693860506 > compaction:<revision:301 > 
   2           346      norm    header:<ID:15517309547693860511 > txn:<compare:<target:MOD key:"/registry/pods/default/n3ZPy" mod_revision:163 > success:<request_delete_range:<key:"/registry/pods/default/n3ZPy" > > failure:<request_range:<key:"/registry/pods/default/n3ZPy" > > > 
   3           347      norm    
   3           348      norm    header:<ID:15517309547694342401 > cluster_member_attr_set:<member_ID:14578408409545168728 member_attributes:<name:"TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0" client_urls:"http://localhost:20000" > > 
   3           349      norm    header:<ID:15517309547694342481 > txn:<compare:<target:MOD key:"/registry/pods/default/vZJa5" mod_revision:94 > success:<request_delete_range:<key:"/registry/pods/default/vZJa5" > > failure:<request_range:<key:"/registry/pods/default/vZJa5" > > > 
   3           350      norm    header:<ID:15517309547694342483 > txn:<compare:<target:MOD key:"/registry/pods/default/5Dtb4" mod_revision:251 > success:<request_put:<key:"/registry/pods/default/5Dtb4" value:"208" > > failure:<request_range:<key:"/registry/pods/default/5Dtb4" > > > 
   3           351      norm    header:<ID:15517309547694342485 > txn:<compare:<target:MOD key:"/registry/pods/default/E7Zuy" mod_revision:140 > success:<request_put:<key:"/registry/pods/default/E7Zuy" value:"209" > > failure:<request_range:<key:"/registry/pods/default/E7Zuy" > > > 
   3           352      norm    header:<ID:15517309547694342486 > txn:<compare:<target:MOD key:"/registry/pods/default/E7Zuy" mod_revision:140 > success:<request_put:<key:"/registry/pods/default/E7Zuy" value:"206" > > failure:<request_range:<key:"/registry/pods/default/E7Zuy" > > > 
   3           353      norm    header:<ID:15517309547694342488 > txn:<compare:<target:MOD key:"/registry/pods/default/HhKkd" mod_revision:112 > success:<request_put:<key:"/registry/pods/default/HhKkd" value:"207" > > failure:<request_range:<key:"/registry/pods/default/HhKkd" > > > 

The question remains, whether this inconsistency was caused by hardware or software issue. Number of deletes
before failpoint observed in the report exceeds the number in my reproduction.

Lack of reproduction could be caused issue only happening in rare cases of robustness traffic generation. So I think it's still worthy to try to investigate this.

Dump from WAL logs confirms that inconsistency really happen.

To make it clearer, it's one node cluster. The bbolt db is consistent with WAL data with regard to revision 298, and also consistent with the report of client 3, 4 of the robustness test ( I only checked report of client 1, 3 and 4).

  • The key /registry/pods/default/vZJa5 was removed at revision 298.

The only inconsistence is for client 1's report.

  • The key /registry/pods/default/o95Cz was removed at revision 298

Theoretically, there are two possible reasons:

  • The watch has issue when distributing the response.
  • The robustness test has issue when generating the client report

But If I understood it correctly, all the clients in robustness test share the same gRPC watch stream, so the etcdserver sent the data (revision 298) only once to the client side, and it's just that the etcd client sdk dispatches the event to all logic sub streams/watchers. So it's highly unlikely it's watch's issue.

FYI, Reproduce it with #17782 after 17160.566s (8vcores, 16GiB, arm64, 5.15.0-1060-azure).

    logger.go:146: 2024-04-14T15:40:19.770Z     INFO    started server. {"name": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0", "pid": 224554}
    logger.go:146: 2024-04-14T15:40:19.770Z     INFO    Verifying cluster health after failpoint        {"failpoint": "compactBeforeSetFinishedCompact=panic()"}
    logger.go:146: 2024-04-14T15:40:20.771Z     INFO    Finished injecting failures
    logger.go:146: 2024-04-14T15:40:21.852Z     INFO    Recorded operations     {"operations": 937, "successRate": 0.7257203842049093}
    logger.go:146: 2024-04-14T15:40:21.852Z     INFO    Traffic from successful requests        {"qps": 266.3974175431209, "operations": 680, "period": "2.552577297s"}
    logger.go:146: 2024-04-14T15:40:21.852Z     INFO    Finished simulating traffic     {"max-revision": 415}
    logger.go:146: 2024-04-14T15:40:21.853Z     INFO    killing server...       {"name": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0"}
    logger.go:146: 2024-04-14T15:40:21.853Z     INFO    stopping server...      {"name": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0"}
    logger.go:146: 2024-04-14T15:40:21.856Z     INFO    stopped server. {"name": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0"}
    logger.go:146: 2024-04-14T15:40:21.858Z     INFO    Validating linearizable operations      {"timeout": "5m0s"}
    logger.go:146: 2024-04-14T15:40:25.004Z     ERROR   Linearization failed    {"duration": "3.145713744s"}
    validate.go:36: Failed linearization, skipping further validation
    logger.go:146: 2024-04-14T15:40:25.004Z     INFO    Saving robustness test report   {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1"}
    logger.go:146: 2024-04-14T15:40:25.004Z     INFO    Saving member data dir  {"member": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0", "path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/server-TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0"}
    logger.go:146: 2024-04-14T15:40:25.004Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-1/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.004Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-2/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.005Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-3/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.005Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-3/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.006Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-4/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.006Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-4/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.007Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-5/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.007Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-5/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.008Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-6/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.009Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-6/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.009Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-7/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.010Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-7/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.010Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-8/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.011Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-8/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.011Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-9/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.012Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-9/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.012Z     INFO    Saving watch operations {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-10/watch.json"}
    logger.go:146: 2024-04-14T15:40:25.013Z     INFO    Saving operation history        {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/client-10/operations.json"}
    logger.go:146: 2024-04-14T15:40:25.013Z     INFO    Saving visualization    {"path": "/tmp/results/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/history.html"}
    logger.go:146: 2024-04-14T15:40:25.037Z     INFO    closing test cluster...
    logger.go:146: 2024-04-14T15:40:25.037Z     INFO    closing server...       {"name": "TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0"}
    logger.go:146: 2024-04-14T15:40:25.037Z     INFO    removing directory      {"data-dir": "/tmp/TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize13457912620/001"}
    logger.go:146: 2024-04-14T15:40:25.037Z     INFO    closed test cluster.
--- FAIL: TestRobustnessExploratory (7.4FAIL: (code:1):
  % (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/robustness' '-timeout=30m' '-v' '--count' '10000' '--timeout' '2000m' '--failfast' '--run' 'TestRobustnessExploratory/Kubernetes/LowTraffic/ClusterOfSize1')
9s)
    --- FAIL: TestRobustnessExploratory/Kubernetes/LowTraffic/ClusterOfSize1 (7.48s)
FAIL
FAIL    go.etcd.io/etcd/tests/v3/robustness     17160.566s
FAIL

Updated:

The decreased revision happened after restarted.
Checking it.

@ahrtr @serathius

checked https://github.com/etcd-io/etcd/actions/runs/8659974818 and found compaction encountered error","error":"mvcc: required revision is a future revision. I also see that similar error in my local. It looks like some data is still cache before crash. Hope it can help.

2024-04-12T10:04:33.4871783Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd (TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0) (29588): {"level":"warn","ts":"2024-04-12T10:04:33.474659Z","caller":"auth/store.go:1134","msg":"simple token is not cryptographically signed"}
2024-04-12T10:04:33.4875810Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd (TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0) (29588): {"level":"info","ts":"2024-04-12T10:04:33.475449Z","caller":"mvcc/kvstore.go:393","msg":"kvstore restored","current-rev":297}
2024-04-12T10:04:33.4880067Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd (TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0) (29588): {"level":"warn","ts":"2024-04-12T10:04:33.475519Z","caller":"mvcc/kvstore.go:397","msg":"compaction encountered error","error":"mvcc: required revision is a future revision"}
2024-04-12T10:04:33.4884515Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd (TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0) (29588): {"level":"info","ts":"2024-04-12T10:04:33.475569Z","caller":"mvcc/kvstore.go:400","msg":"resume scheduled compaction","scheduled-compact-revision":301}

checked https://github.com/etcd-io/etcd/actions/runs/8659974818 and found compaction encountered error","error":"mvcc: required revision is a future revision. I also see that similar error in my local. It looks like some data is still cache before crash. Hope it can help.

Thanks @fuweid for the finding. There are two possible reasons:

  • The latest revision hasn't been persisted into bbolt db when crashing. This isn't an issue, but we need to update the log to avoid confusion. See #17792
  • The indeed revisions indeed decreased. It should be a critical issue if it's true. Added a verification. See #17791

We need to backport both PRs.

With #17815 this issue have been confirmed and reproduced on all supported release branches.

Root cause

Based on my discussion with @fuweid today, for anyone reference on the root cause of this issue, see below summary.

  • When latest revision is a tombstone revision, compacting the latest revision will remove the latest revision. It causes the revision decreases in bbolt data file.
  • If etcd crashes right before it persists the finished compaction revision. Then the finished compact revision isn't persisted to bbolt
    UnsafeSetFinishedCompact(tx, compactMainRev)
  • When etcd gets started again, it loads all keys and gets the latest revision (already decreased revision) from bbolt. Usually it isn't a problem, because etcd will correct the revision based on the finished compaction revision. It's exactly the reason why previous I did not reproduce it when I raised the comment #17780 (comment). But if the finished compact revision hasn't been persisted, then there is no way for etcd to correct the revision.
    s.compactMainRev = finishedCompact

    if s.currentRev < s.compactMainRev {
    s.currentRev = s.compactMainRev
    }

Versions affected

All versions (3.4.x, 3.5.x, main) have this issue.

For single node cluster, the symptom is the revision decreases.

For multi node cluster, the symptom is not only the revision decreases, but also inconsistent revisions across the etcd cluster.

Note the key/value data is still consistent when this issue is reproduced.

Hard to reproduce

The good news is that this issue should be very hard to reproduce in production environment, because It can only be reproduced when all the following conditions are true,

  • Compact the latest revision;
  • The latest revision is a tombstone (a deletion);
  • etcd crashes after it removes the latest tombstone revision and before it persists the finishedCompactRevision. Obviously it's very small window.

Solution

One proposed solution: #17815 (comment)

Another solution is updating currentRev using the scheduledCompactRevision on bootstrap. See

Workaround

Once it's reproduced, we can use the bump-revision to manually bump the revision to make all etcd instances have consistent revision.

With the issue confirmed we need to do impact assessment, check which version it affects, how often it can happen, can it cause inconsistency in multi member cluster and if it can happen in Kubernetes.

The good news is that this issue should be very hard to reproduce in production environment.

Can you provide some context why it should very hard to reproduce, so for everyone can follow? Is it just based on small window of crash vulnerability due to infrequency of compact? How that probability looks for Kubernetes?

Fact that last revision needs to be a tombstone, reduces the chances, but would be good to confirm what percentage do deletes constitute.

The good news is that this issue should be very hard to reproduce in production environment.

Can you provide some context why it should very hard to reproduce, so for everyone can follow?

Sorry for the confusion. I should be more clearer. Just updated my previous comment.

Sorry for the confusion. I should be more clearer. Just updated my previous comment.

Thanks, looks great!

Compact the latest revision;

This makes sense, compacting last revision is not expected behavior from most users. Kubernetes built in compaction should almost never do that. "Almost" comes from cases where there were no writes at all for 5 minutes, which is unexpected due to Kubernetes Node Lease and Leader election writing periodically.

@fuweid I am going to release etcd 3.4.32 together with @spzala tomorrow night, can you please #17815 to 3.5 and 3.4 ? Please feel free to let me know if you don't have bandwidth. Thanks

Hi @ahrtr

There are 3 pull requests, please take a look. Thanks

@fuweid Thanks. Please also update the changelog for 3.4 and 3.5.