Elpi fail on OCaml-multicore
moyodiallo opened this issue · comments
Hi
I found the package doesn't work on OCaml multicore
(ocaml-variants.4.12.0+domains) because of Segfault
[1] 2563302 segmentation fault (core dumped) ./_build/install/default/bin/elpi -test -I -I tests/sources/ ackermann.elpi
I did a test on OCaml nnpchecker
(ocaml-variants.4.12.0+nnpchecker or ocaml-option-nnpchecker.1) and
found Out of heap pointer
make build
./_build/install/default/bin/elpi -test -I _build/install/default/bin/../lib/elpi/ -I tests/sources/ ackermann.elpi
Parsing time: 0.001
Compilation time: 0.002
Out-of-heap pointer at 0x7fe85d590610 of value 0xffff8017a27027c2. Cannot read head.
Out-of-heap pointer at 0x7fe85d5905c0 of value 0xffff8017a2702102. Cannot read head.
Out-of-heap pointer at 0x7fe85d590570 of value 0xffff8017a26ff99a. Cannot read head.
Out-of-heap pointer at 0x7fe85d590520 of value 0xffff8017a27187c2. Cannot read head.
Out-of-heap pointer at 0x7fe85d5904d0 of value 0xffff8017a2718102. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b4630 of value 0xffff8017a2554322. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b45f8 of value 0xffff8017a2a55662. Cannot read head.
Out-of-heap pointer at 0x7fe85d5b45c0 of value 0xffff8017a2a554fa. Cannot read head.
...
It seems to come from using Obj
module, it's the same for some code. Could you review your Obj
's usage please ?
This is important to consider for OCaml 5.00 (next coming): ocaml-health-check
I'm aware of the failure. There is one use of Obj which I'm getting (slowly) rid of in this branch: #118. I guess you have the right compiler set up, could you please run it on this branch?
There's Segfault
with this branch #118 on those switches :
ocaml-base-compiler.4.12.0
ocaml-variants.4.12.0+domains
ocaml-option-nnpchecker.1
ok, thanks for testing.
any hints on what became unsafe?
It's about using Obj
module: Obj.magic
make the program unsafe.
This is a simple example:
type typeA = Name of string | Nat
let n1 = Name "fifo"
let n2 = Nat
let n = Obj.magic n2
let m = String.get n 0
---------------------------------------------
$ocamlopt test.ml
$./a.out
[2] 2616032 segmentation fault (core dumped) ./a.out
A program without Obj
would be better because internal objects in the runtime are not stable either
Sorry I was not clear. I know what Obj is. I was asking what chenged in the runtime of ocaml that made previously "legit" uses of magic illegal.
I will investigate this. I don't recall having many uses of Obj anyway.
I think I found the culprit, but I don't know how to fix it or to replace this code:
Lines 614 to 712 in 332dbea
In particular this is the offending part:
Lines 644 to 655 in 332dbea
This data structure is a faster List.assq
: it uses the value of the address of a boxed data to build a non authoritative search tree and uses it instead of a linear scan. If the object was moved by the gc to another address we fall back to the linear scan and update the search tree.
I lack knowledge of the new Gc, would you mind putting in CC knowledgeable people?
A program without
Obj
would be better because internal objects in the runtime are not stable either
What I meant by that, using Obj.magic
could easily seeing by the Gc like naked pointers. OCaml+domains doesn't admit naked pointers. The next OCaml version(5.00) will support domains by default.
@kayceesrk, @Engil : have you something to share here ?
FTR, I'm happy to replace this data structure with something else. Even better if the table keys were weak (but I did read somewhere that ephemerons are not available in 5.0...)
@gares Ephemerons are available in 5.0. Some of the unsupportable functions in multicore have been removed. Please find the supported API here: https://github.com/ocaml-multicore/ocaml-multicore/blob/5.00/stdlib/ephemeron.mli. The removed functions have also been marked deprecated
on trunk OCaml: https://github.com/ocaml/ocaml/blob/trunk/stdlib/ephemeron.mli.
Thanks ill give ephemerons a try then (they were not there I believe when I coded this).
But what about the int cast trick I was doing here? I found that 5.0 uses more tag bits to represent domains, but I could not find why this code is now broken. I'm curious.
I tried to port the code to ephemerons, but I forgot I can't, this is why I had this custom map.
The problem I have is that the boxed value I have cannot be hashed. In particular It is something like term option ref
and the instances I need to put in a map are the ones where the ref points is None
(it's a logic programming language, unification variables are mutable, they born unassigned and end up assigned eventually).
So I can't possibly provide a decent hash function, since all ref cells I need to use as keys contain 0. The old code was using the address of the ref cell as a hash value, and the rest of the code was coping with the fact that the GC could move the cell (something that happens for sure, but not very frequently, so after all the lookup was quick on the average).
So I'm afraid I need to repair the old code. @kayceesrk could you shed some light on why the old code is broken? (the address_of thing seems the culprit).
After some thinking I found a way, see PR #127. The performances are a bit weird, so I need to investigate more on this PR, but are OKish (there is some effect on code paths which are not really touched by the patch).
One very weird thing is that the elpi parser (written in camlp5) now needs a lot more of stack space.
See 831b06d
This is not a blocking issue, since I want to ditch camlp5 eventually, but looks very weird to me.
Is it a known "problem" of multicore? Does it ring a bell?
The increased stack space requirement doesn't ring a bell. It may be useful to open an issue on OCaml github repo when you get the chance. Thanks!