ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml

Home Page:https://ocaml-multicore.github.io/multicoretests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

'STM _ ref test parallel asymmetric' failure to trigger

jmid opened this issue · comments

Since adding agree_prop_par_asym to STM_domain in #315 we have seen occasional cases of the _ ref test parallel asymmetric failing to trigger.

Here's a fresh case from yesterday, on Cygwin part2 running trunk (5.2.0):
https://github.com/ocaml-multicore/multicoretests/actions/runs/5201990235/jobs/9388312861

random seed: 513749813
generated error fail pass / total     time test name
[...]

[ ]    0    0    0    0 / 2000     0.0s STM int ref test parallel asymmetric
[ ] 1501    0    0 1501 / 2000   395.8s STM int ref test parallel asymmetric
[✗] 2000    0    0 2000 / 2000   410.2s STM int ref test parallel asymmetric

[...]

Previously this was observed

  • in #346 on macOS 5.1 with int64 ref test parallel asymmetric
  • in #339 on Windows trunk int ref test parallel asymmetric
  • in #330 on Cygwin trunk and 5.1 with int ref test parallel asymmetric and int64 ref test parallel asymmetric

It would be nice to use statistics as outlined in #362 to improve the functionality to be more stable.

A fix may involve switching away from a Semaphore.Binary as @shym was originally asking about
#315 (comment)

I saw this again on Cygwin part 2 5.1~alpha2
https://github.com/ocaml-multicore/multicoretests/actions/runs/5224077280/jobs/9437485264

random seed: 478928519
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM int ref test parallel
[ ]    0    0    0    0 / 1000     0.0s STM int ref test parallel (generating)
[ ]    6    0    0    6 / 1000    68.4s STM int ref test parallel (shrinking:    9.0003)
[ ]    6    0    0    6 / 1000   132.7s STM int ref test parallel (shrinking:   11.0002)
[ ]    6    0    0    6 / 1000   193.6s STM int ref test parallel (shrinking:   12.0002)
[ ]    6    0    0    6 / 1000   254.4s STM int ref test parallel (shrinking:   12.0009)
[ ]    6    0    0    6 / 1000   315.6s STM int ref test parallel (shrinking:   13.0008)
[ ]    6    0    0    6 / 1000   376.0s STM int ref test parallel (shrinking:   14.0008)
[✓]    7    0    1    6 / 1000   424.9s STM int ref test parallel

[ ]    0    0    0    0 / 1000     0.0s STM int64 ref test parallel
[ ]    0    0    0    0 / 1000    18.0s STM int64 ref test parallel (shrinking:    5.0002)
[ ]    0    0    0    0 / 1000    85.0s STM int64 ref test parallel (shrinking:   10.0005)
[ ]    0    0    0    0 / 1000   153.5s STM int64 ref test parallel (shrinking:   11.0007)
[✓]    1    0    1    0 / 1000   170.6s STM int64 ref test parallel

[ ]    0    0    0    0 / 2000     0.0s STM int ref test parallel asymmetric
[ ]  790    0    0  790 / 2000    42.9s STM int ref test parallel asymmetric
[ ] 1444    0    0 1444 / 2000   251.0s STM int ref test parallel asymmetric
[ ] 1961    0    0 1961 / 2000   324.8s STM int ref test parallel asymmetric
[✗] 2000    0    0 2000 / 2000   337.9s STM int ref test parallel asymmetric

[ ]    0    0    0    0 / 2000     0.0s STM int64 ref test parallel asymmetric
[ ]  419    0    0  419 / 2000    47.3s STM int64 ref test parallel asymmetric
[ ]  612    0    0  612 / 2000   108.1s STM int64 ref test parallel asymmetric
[ ]  647    0    0  647 / 2000   175.7s STM int64 ref test parallel asymmetric (shrinking:    0.0012)
[✓]  648    0    1  647 / 2000   179.2s STM int64 ref test parallel asymmetric
File "src/neg_tests/dune", line 19, characters 7-27:

19 |  (name stm_tests_domain_ref)
--- Info -----------------------------------------------------------------------
            ^^^^^^^^^^^^^^^^^^^^

(cd _build/default/src/neg_tests && ./stm_tests_domain_ref.exe --verbose)
Negative test STM int ref test parallel failed as expected (14 shrink steps):
Command exited with code 1.

                        |                
                        |                
             .---------------------.
             |                     |                
          Add 190                Decr               
            Get                                     

[...]

--- Failure --------------------------------------------------------------------

Test STM int ref test parallel asymmetric failed:

Negative test STM int ref test parallel asymmetric succeeded but was expected to fail

[...]

================================================================================
failure (1 tests failed, 0 tests errored, ran 4 tests)

Reopening as this is showing up again.
#368 switched away from Semaphore.Binary to using an int Atomic uniformly.
Unfortunately on macOS this change represents a regression:

  • On eeb3a55 (a direct commit with only src/README updates) we observed
    • macOS trunk: both STM {int,int64} ref test parallel asymmetric failed to trigger
    • macOS 5.1: STM int ref test parallel asymmetric failed to trigger
  • On #371 both macOS 5.1 and trunk failed to trigger STM int64 ref test parallel asymmetric

Ideally, we should try to understand why this pattern doesn't work well consistently under macOS.

Even before switching from Semaphore.Binary to int Atomic we saw occasional failures to trigger under macOS.
As such, I don't think simply using Semaphore.Binary on macOS and int Atomic elsewhere is going to cut it... 🤔

With the merge of #371 into main this just happened again:

The merge into main of #376 triggered another one of these on macOS 5.1 (hopefully the last before #377)
https://github.com/ocaml-multicore/multicoretests/actions/runs/5558723998/jobs/10154146293

random seed: 518031986
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM int ref test parallel
[ ]    0    0    0    0 / 1000     0.0s STM int ref test parallel (generating)
[✓]    1    0    1    0 / 1000     3.3s STM int ref test parallel

[ ]    0    0    0    0 / 1000     0.0s STM int64 ref test parallel
[✓]    4    0    1    3 / 1000     1.8s STM int64 ref test parallel

[ ]    0    0    0    0 / 2000     0.0s STM int ref test parallel asymmetric
[✗] 2000    0    0 2000 / 2000    21.8s STM int ref test parallel asymmetric

[ ]    0    0    0    0 / 2000     0.0s STM int64 ref test parallel asymmetric
[✓]  475    0    1  474 / 2000    15.4s STM int64 ref test parallel asymmetric

...

Observed an occurrence of this again, on macOS 5.0.0, despite the bump to a count of 5000:
https://github.com/ocaml-multicore/multicoretests/actions/runs/5633978353/job/15263432136

random seed: 119145923
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 5000     0.0s STM int ref test parallel asymmetric
[ ]    0    0    0    0 / 5000     0.0s STM int ref test parallel asymmetric (generating)
[ ] 4588    0    0 4588 / 5000    60.0s STM int ref test parallel asymmetric
[✗] 5000    0    0 5000 / 5000    65.0s STM int ref test parallel asymmetric

As this is the only observation so far, this may however be limited to 5.0.0.

Just saw this again on macOS 5.1.