Release runtime system in native implementation

Question

Release runtime system in native implementation

andersfugmann opened this issue 6 years ago · comments

Anders Peter Fugmann commented 6 years ago

Would it be possible to change the library to release the runtime lock for update operations in the native version of the library, to allow the user to offload update operations to other cpu's?

I can make a PR to call caml_release_runtime_system() and caml_acquire_runtime_system() from the update functions operating on bigarrays.
However this would require that the memory location of ctx is not moved by the ocaml gc while update function is running. One easy solution for this would be to change ctx from type By.t to Bi.t (bigarray). Alternatively create a custom type that allocates the internal structure outside the ocaml heap.

Calascibetta Romain · Answer 1 · Wed Oct 03 2018 19:47:37 GMT+0800 (China Standard Time)

Hmmhmm, the question is little bit hard. I'm little bit afraid to change the ctx to be a Bigstring.t and keep the referentially transparent assumption (which is really needed when it fixed a bug on ocaml-git). Indeed, do a copy at any step of the ctx (see the dup function) when ctx is a bigarray should be not an efficient case.

In the other case, leave the lock of the runtime should be, indeed interesting about performance. I think, we need a benchmark to see if it's very significant to do that.

/cc @samoht, @hannesm, @cfcs and @hcarty

Anders Peter Fugmann · Answer 2 · Wed Oct 03 2018 21:05:08 GMT+0800 (China Standard Time)

I'm not sure I understand what you mean about the referentially transparent assumption on ctx.

My idea of converting to a bigarray is that the data is allocated outside the ocaml heap, and will therefore not move - hence we can safely relinquish the runtime lock and call update_ba.

If we keep ctx as a string, we need to copy it outside the caml heap for every call to update_ba, and then copy back.

I can make some tests to see the performance when multiple threads calculates e.g sha256 though multiple calls to update_ba and post the results also.

Calascibetta Romain · Answer 3 · Wed Oct 03 2018 21:50:39 GMT+0800 (China Standard Time)

If we keep ctx as a string, we need to copy it outside the caml heap for every call to update_ba, and then copy back.

This is exactly what we do, to avoid side-effect on ctx (make a new one at any computation/call to update function). dup explicitly copy ctx at the beginning of any update. Only on internal stuff (like digesti_* function), we keep physically the same ctx for update function as long as iter runs and return the new ctx - the old one was not modified (the one given by the client).

You can see some examples at this line.

So, from your request (ctx is a bigarray), we have two possibilities:

continue to dup ctx at the beginning of any computation - we keep the semantic of digestif, however this is not an efficient case for bigarray
break semantic and allocate only one time ctx, keep it as long as we need it and change it (by side-effect) at any computation - note that this is the previous semantic of digestif before #29

Out of the scope of digestif, #29 fixed a problem on ocaml-git magically (to be honest, I did not try to understand exactly why #29 fixed ocaml-git and I just said: hmmhmm, deeply side-effect stuff and I'm lazy to understand it). I just put a test-case where I get the error.

So, I mostly followed request from @hannesm and @cfcs - whom point on me an update of mirleft/ocaml-nocrypto#125 (see this commit).

C For C's Sake · Answer 4 · Thu Oct 04 2018 06:22:31 GMT+0800 (China Standard Time)

I'm a big fan of the interface introduced in #29, and I think that breaking that by reverting to the previous interface would be a mistake.

I have never used OS threads in OCaml programs, so I don't really know anything about the runtime lock. While the memory pointed to by a Bigarray won't move around if we release the runtime lock, how do you guarantee that it doesn't get free()'d and/or reused?

@andersfugmann can you describe your scenario and the problems you face with the current interface in a bit more depth?

Hezekiah M. Carty · Answer 5 · Thu Oct 04 2018 06:50:00 GMT+0800 (China Standard Time)

I would, depending on the overhead involved, be happy to take advantage of a change like this.

Releasing the runtime lock would allow a program to calculate hashes from inside of a preemptive thread without blocking anything else. That means a cohttp-lwt server, for example, could calculate digests in a preemptive thread pool without (much) blocking of other concurrent requests.

Anders Peter Fugmann · Answer 6 · Thu Oct 04 2018 06:53:52 GMT+0800 (China Standard Time)

All the objects are alive (part of the root set) while calling a function, so the objects will not be free'd while calling a function, even if this function releases the runtime lock (at least that's my understanding).

The problem I'm trying to address is that currently calling update_ba functions will not release the lock on the runtime system, meaning that at most one core can be utilized. I also never use system threads in ocaml programs explicitly, but both async and lwt does provide API's for executing functions in a separate thread, to not block the main thread. But this only makes sense if the functions can release the runtime lock. More concretely, I have an application that does multiple concurrent uploads to aws s3. The only cpu consuming operation is sha256 calculation, and the application does at times reach 100% cpu utilization, which hurts latency and throughput.

Wrt. ctx being copied or a bigarray: I misread the API, as I thought that update would modify the context. Sorry for the confusion.

I do agree that its nice to have a pure interface. I will try to produce a sample PR to show how it can be implemented without changing the semantics, and produce some benchmarks, and then we can take it from there.

Calascibetta Romain · Answer 7 · Thu Oct 04 2018 19:32:23 GMT+0800 (China Standard Time)

Feel free to propose a PR about that and then, we will have more materials to make the best choice 👍 .

Calascibetta Romain · Answer 8 · Thu Oct 04 2018 19:36:45 GMT+0800 (China Standard Time)

And, unfortunately I decided (before your issue) to freeze projects needed by ocaml-git to do a proper release. So, this issue will not be integrated to digestif.0.7.

Anders Peter Fugmann · Answer 9 · Fri Oct 05 2018 14:30:11 GMT+0800 (China Standard Time)

I have made an implementation, and will create a PR shortly with the changes for further discussion.

I've also benchmarked the execution speed of calling Digestif_native.SHA256.Bigstring.update to understand the extra overhead of stripping the no_alloc, copying ctx onto the c stack and releasing the runtime lock:

┌───────────────┬────────────────────┬──────────────────┬─────────────────┐
│ Test          │      Data size: 0  │    Data size: 64 │ Data Size: 1024 │ 
├───────────────┼────────────────────┼──────────────────┼─────────────────┤
│ reference     │     8.47ns(x1.00)  │   344.07ns(1.00) │ 5378.25ns(1.00) │
│ release       │    79.79ns(x9.42)  │   414.12ns(1.20) │ 5462.04ns(1.02) │
│ alloc         │     8.67ns(x1.02)  │   356.71ns(1.04) │ 5389.23ns(1.00) │
│ no_release    │    26.50ns(x3.13)  │   354.64ns(1.03) │ 5417.55ns(1.01) │
│ reference(-O3)│     8.66ns(x1.02)  │   284.69ns(0.83) │ 4442.58ns(0.83) │
└───────────────┴────────────────────┴──────────────────┴─────────────────┘

The tests done using Core_bench. The function tested is
Digestif_native.SHA256.Bigstring.update (Bytes.copy ctx) ba 0 n,
where ba is a bigarray, and n is the size (starting at offset 0).
No allocation are done in the tight loop (ctx is not duplicated).

reference: Standard implementation for reference.
release tests a full implementation that releases the runtime lock before doing cpu intensive work.
alloc tests the effect of only removing the [@noalloc] clause on the external call
no_release: same as release, but without actually releasing the runtime lock.
reference(-03): Standard implementation, where native stubs are compiled with -O3

From this we can see releasing and reacquiring the runtime lock add a constant overhead of about 50ms per call. Removing the [@noalloc] as close to no effect. Copying the ctx to the c-stack costs around 17ns.

If update functions are generally called on large data buffers, then the extra overhead is negligible, but data sizes around 64 bytes are used, then there will be an impact of 20% (Which would almost be cancelled out if we were to compile the native c stubs with -O3)

Alternatively we could add update_mt function to the API to let the user choose between one of the two implementations.

Calascibetta Romain · Answer 10 · Fri Oct 05 2018 16:58:59 GMT+0800 (China Standard Time)

On my side, from materials, benchmarks and pull-request, all seems to be very good. I just want to clear some points.

digestif has a constraint about API (a shared interface). If we try to provide update_mt function on the C implementation, we need to provide something similar to the OCaml implementation (which does not make any sense).

I'm aware about the constant overhead when we leave the lock, however we need to keep in our mind that we already provide 2 way to feed ctx, the Bytes.t (which does not have this overhead) and the Bigstring.t way. As you said, update_ba should be called on large data buffers, and, as you said again, overhead in this context is negligible. In other side, update_string/update_bytes should be called on small data buffers.

From all of these, I prefer to merge your PR (and update_ba will leave the lock in any case) instead to provide update_ba and update_mt.

Then, about -O3, I'm not a big fan when this level of optimization can change semantic of C code arbitrary (-O2 should be fine).

And finally, it's good to keep the semantic about ctx. I would like to know what people think about your PR before but it seems reasonable 👍 !

Anders Peter Fugmann · Answer 11 · Fri Oct 05 2018 17:03:53 GMT+0800 (China Standard Time)

Thanks for you comments. The -O3 was provided out of curiosity :-).
There is much more to be gained by using advanced instruction sets (sse / avx2 / avx512).

C For C's Sake · Answer 12 · Sat Oct 06 2018 07:50:47 GMT+0800 (China Standard Time)

@andersfugmann: Ah, thanks for clarifying! This sounds like a good change, thanks for pointing it out and working on this!

Calascibetta Romain · Answer 13 · Fri Oct 12 2018 23:19:53 GMT+0800 (China Standard Time)

All is green (people, Travis and so on), so thanks for your work and ready for the next release of digestif.