parmap throws Out of memory exception on 32 bit architectures

Question

parmap throws Out of memory exception on 32 bit architectures

josch opened this issue 8 years ago · comments

Hi,

I'm using parmap to parallelize parts of the creation of Debian dependency graphs. Recently I noticed that given the same input, my program would quit with:

Fatal error: exception Out of memory

on all 32 bit architectures. Here is an example build log of the problem on i386:

https://buildd.debian.org/status/fetch.php?pkg=botch&arch=i386&ver=0.18-1&stamp=1473844641

The problem does not occur in the exact same situation on a 64 bit architecture like amd64. I can also confirm that the problem is not a lack of physical memory because I get the same error inside a 32 bit i386 chroot on an amd64 system where my program using parmap runs without problems.

Furthermore, the problem goes away if I increase the parallelism of parmap. In my specific setup the problem occurs when I split the job into two but disappears when I split the job into four.

I rebuilt my software with debugging enabled and ran it with OCAMLRUNPARAM=b and was able to confirm that the exception gets thrown by this line in my code:

Parmap.parmap ~ncores:num_cores worker (Parmap.L todo)

I did not yet rebuild parmap with debugging enabled to follow the error deeper into the parmap code.

Since this problem is 32 bit specific, I guess this is related to OCaml string or buffer size constraints?

Roberto Di Cosmo · Answer 1 · Thu Sep 15 2016 05:04:32 GMT+0800 (China Standard Time)

Hi Josh,
indeed, we do some black magic in Parmap for allocating the space for
the result vector/lists : since the result size cannot, in principle,
be known in advance, we memallocate a predefined fixed maximum space,
and if the result of the parmap is too big, you can get an out of memory error.

On 64bit architectures this space is so huge that it should be ok for all
foreseeable computations.

On 32 architectures, though, there is not enough virtual memory address space,
and we are forced to allocate a much smaller predefined space.

Apparently, you are the first Parmap user to hit this limit on 32 bit
architectures: congratulations ;-)

Roberto

On Wed, Sep 14, 2016 at 01:36:05PM -0700, josch wrote:

Hi,

I'm using parmap to parallelize parts of the creation of Debian dependency
graphs. Recently I noticed that given the same input, my program would quit
with:

Fatal error: exception Out of memory

on all 32 bit architectures. Here is an example build log of the problem on
i386:

https://buildd.debian.org/status/fetch.php?pkg=botch&arch=i386&ver=0.18-1&stamp
=1473844641

The problem does not occur in the exact same situation on a 64 bit architecture
like amd64. I can also confirm that the problem is not a lack of physical
memory because I get the same error inside a 32 bit i386 chroot on an amd64
system where my program using parmap runs without problems.

Furthermore, the problem goes away if I increase the parallelism of parmap. In
my specific setup the problem occurs when I split the job into two but
disappears when I split the job into four.

I rebuilt my software with debugging enabled and ran it with OCAMLRUNPARAM=b
and was able to confirm that the exception gets thrown by this line in my code:

Parmap.parmap ~ncores:num_cores worker (Parmap.L todo)

I did not yet rebuild parmap with debugging enabled to follow the error deeper
into the parmap code.

Since this problem is 32 bit specific, I guess this is related to OCaml string
or buffer size constraints?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.*

Roberto Di Cosmo

Professeur (on leave at/detache a INRIA Roquencourt)
IRIF E-mail : roberto@dicosmo.org
Universite Paris Diderot Web : http://www.dicosmo.org
Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann

F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123
Batiment Sophie Germain Batiment C
8 place Aurélie Nemours 2, Rue Simone Iff
Tel: +33 1 57 27 92 20 Tel: +33 1 80 49 44 42

Metro
Bibliotheque F. Mitterrand Ligne 6: Dugommier

ligne 14/RER C Ligne 14/RER A: Gare de Lyon

GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

Roberto Di Cosmo · Answer 2 · Thu Sep 15 2016 05:19:13 GMT+0800 (China Standard Time)

Hmm, looking at the current source code, I see that my previous comment is imprecise: it was appropriate for an old implementation of Parmap.

You can see the relevant comments in the source code starting from here:
https://github.com/rdicosmo/parmap/blob/master/parmap.ml#L92

In recent versions of Parmap, we introduced another approach to avoid issues on *bsd systems: we precompute the size of the result data structure, by marshaling it to a string first.

Hence, Parmap ends up using more memory than you expect (the result data structure, the marshaled string, and an mmap of the same size), so if you handle huge datasets, on 32 bit architectures you may end up exhausting memory space earlier than expected.

josch · Answer 3 · Thu Sep 15 2016 11:49:14 GMT+0800 (China Standard Time)

Aha, thanks for your explanation!

Then indeed this problem sounds like a WONTFIX because you are hitting the limits of OCaml on 32bit architectures.

I guess the best thing to do for me would be to dynamically split my input list by two recursively and pass the sub-lists to parmap with the requested paralelism one after another until I don't get out of memory errors from any job anymore.

On the other hand, maybe because of this OCaml limitation it would make sense for parmap to take care of this itself in this fashion in case it experiences an out of memory error?

I'll most my out-of-parmap solution here once I have it but feel free to close this bug report and thanks for your input!

josch · Answer 4 · Thu Sep 15 2016 12:55:57 GMT+0800 (China Standard Time)

I'm now doing this:

        let rec splitjob maxsplits todo =
          try
            Parmap.parmap ~ncores:num_cores worker (Parmap.L todo)
          with
          | Failure "input_value_from_block: bad object" -> begin
              if maxsplits <= 0 then fatal "exceeded maximum split depth";
              warning "parmap raised out of memory exception with list length %d, splitting job..." (List.length todo);
              let l1, l2 = List.split_nth ((List.length todo)/2) todo in
              let res1 = splitjob (maxsplits-1) l1 in
              let res2 = splitjob (maxsplits-1) l2 in
              List.append res1 res2
            end
        in
        let maxsplits = 4 in
        splitjob maxsplits todo

Maybe what parmap could do, would be to throw a more specific exception than just Failure which could mean anything (I saw that exception before). It is wrong if I just try to catch an Out_of_memory exception here because that exception seems to be still in the processes created by parmap. Hence, if I would catch an Out_of_memory directly, I would get a tree of processes instead of only as many processes as given by num_cores.

So maybe parmap could throw a more specific exception when it runs into an Out_of_memory exception itself so that I can distinguish that sort of failure from others and only split lists if it's really an out of memory problem?

josch · Answer 5 · Thu Sep 15 2016 13:12:42 GMT+0800 (China Standard Time)

It seems not to be possible for parmap to forward a more detailed explanation of the exception because the Failure "input_value_from_block: bad object" comes from the marshalling module...

Roberto Di Cosmo · Answer 6 · Thu Sep 15 2016 14:25:44 GMT+0800 (China Standard Time)

Indeed, this is something which is really not easy to fix cleanly: my feeling is
that for heavy jobs one wants to move to 64 bits architectures anyway, but your
use case is a clear motivation for rethinking this position.

Let's see how your experiments fare :-)

Roberto

On Wed, Sep 14, 2016 at 08:49:14PM -0700, josch wrote:

Aha, thanks for your explanation!

Then indeed this problem sounds like a WONTFIX because you are hitting the
limits of OCaml on 32bit architectures.

I guess the best thing to do for me would be to dynamically split my input list
by two recursively and pass the sub-lists to parmap with the requested
paralelism one after another until I don't get out of memory errors from any
job anymore.

On the other hand, maybe because of this OCaml limitation it would make sense
for parmap to take care of this itself in this fashion in case it experiences
an out of memory error?

I'll most my out-of-parmap solution here once I have it but feel free to close
this bug report and thanks for your input!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.*

Roberto Di Cosmo

Professeur (on leave at/detache a INRIA Roquencourt)
IRIF E-mail : roberto@dicosmo.org
Universite Paris Diderot Web : http://www.dicosmo.org
Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann

F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123
Batiment Sophie Germain Batiment C
8 place Aurélie Nemours 2, Rue Simone Iff
Tel: +33 1 57 27 92 20 Tel: +33 1 80 49 44 42

Metro
Bibliotheque F. Mitterrand Ligne 6: Dugommier

ligne 14/RER C Ligne 14/RER A: Gare de Lyon

GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

ousado · Answer 7 · Fri Oct 14 2016 21:45:04 GMT+0800 (China Standard Time)

Doesn't this mean that the result is limited by the max size of a string in ocaml, which isn't that large for 32bit platforms - 16MB IIRC?

François Bérenger · Answer 8 · Tue Jun 06 2017 12:45:37 GMT+0800 (China Standard Time)

OK, so if results of the map function are too big, one gets this error:

Raised by primitive operation at file "parmap.ml", line 99, characters 11-34
Called from file "parmap.ml", line 209, characters 13-46
Called from file "src/codec.ml", line 145, characters 22-435
Called from file "src/utls.ml", line 36, characters 12-20
Called from file "src/utls.ml", line 36, characters 12-20
Called from file "src/utls.ml", line 48, characters 12-19
Called from file "src/codec.ml", line 166, characters 9-16

François Bérenger · Answer 9 · Tue Jul 23 2019 15:18:28 GMT+0800 (China Standard Time)

One workaround would be to use the parany library (conflict of interest declaration: I am the author of parany). Parany would work even if the list is infinite.
This issue is a won't fix so closing.

josch · Answer 10 · Tue Jul 23 2019 17:03:14 GMT+0800 (China Standard Time)

@UnixJunkie can parany be used as a drop-in replacement for parmap? Does it lack features that parmap has? Is it superior to parmap in other ways than the issue discussed here?

François Bérenger · Answer 11 · Wed Jul 24 2019 09:07:36 GMT+0800 (China Standard Time)

@UnixJunkie can parany be used as a drop-in replacement for parmap?

No. The interface is different, cf. https://github.com/UnixJunkie/parany/blob/master/src/parany.mli

Does it lack features that parmap has?

Core pining, which might make performance better but introduces difficulties when several parallel programs are running on the same computer.

Is it superior to parmap in other ways than the issue discussed here?

Parany is meant to do stream processing in parallel. This is a different use-case than parmap.
Also, with some elbow grease, you could do the same using parmap as a backend.

josch · Answer 12 · Thu Dec 19 2019 07:15:56 GMT+0800 (China Standard Time)

Can we look at this issue again? With an update from ocaml 4.05.0 to 4.08.1 the marshalling module does not anymore throw Failure "input_value_from_block: bad object" and catching Out_of_memory is wrong for the reasons explained above. This in turn leads to a segmentation fault I reported here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946980.

Could this be better handled in parmap?

François Bérenger · Answer 13 · Thu Dec 19 2019 09:04:19 GMT+0800 (China Standard Time)

@josch my advice: try using parany and see if the bug gets away.

Here is how to do a parmap with parany:

let parmap ~ncores ?(csize = 1) f l =
  if ncores <= 1 then BatList.map f l
  else
    let input = ref l in
    let demux () = match !input with
      | [] -> raise Parany.End_of_input
      | x :: xs -> (input := xs; x) in
    let output = ref [] in
    let mux x =
      output := x :: !output in
    (* for safety *)
    Parany.set_copy_on_work ();
    Parany.set_copy_on_mux ();
    (* parallel work *)
    Parany.run ~verbose:false ~csize ~nprocs:ncores ~demux ~work:f ~mux;
    !output

If that works, you can get rid of the ugly error catching and retrial code.

josch · Answer 14 · Thu Dec 19 2019 15:50:16 GMT+0800 (China Standard Time)

This works fine on amd64 but the same command on i386 yields: Fatal error: exception Invalid_argument("Netmcore_mempool.create_mempool: bad size")

josch · Answer 15 · Thu Dec 19 2019 15:57:20 GMT+0800 (China Standard Time)

Funnily, the same exception is thrown even when I reintroduce the "ugly catching and retry code" and hand lists to parany as short as 11 elements... But of course only on i386.

François Bérenger · Answer 16 · Thu Dec 19 2019 16:35:21 GMT+0800 (China Standard Time)

can you show the full stack trace?

export OCAMLRUNPARAM=b

then rerun your crashing program

François Bérenger · Answer 17 · Thu Dec 19 2019 16:43:10 GMT+0800 (China Standard Time)

Here is a workaround that might work.
Notice the new set_shm_size line.
It seems that your computer cannot even allocate one GB of memory?!

let parmap ~ncores ?(csize = 1) f l =
  if ncores <= 1 then BatList.map f l
  else
    let input = ref l in
    let demux () = match !input with
      | [] -> raise Parany.End_of_input
      | x :: xs -> (input := xs; x) in
    let output = ref [] in
    let mux x =
      output := x :: !output in
    (* for safety *)
    Parany.set_copy_on_work ();
    Parany.set_copy_on_mux ();
    Parany.set_shm_size (1024 * 1024);
    (* parallel work *)
    Parany.run ~verbose:false ~csize ~nprocs:ncores ~demux ~work:f ~mux;
    !output

josch · Answer 18 · Thu Dec 19 2019 17:07:20 GMT+0800 (China Standard Time)

Two changes were needed. Adding Parany.set_shm_size (1024 * 1024); was necessary and on top of that /dev/ had to be mounted or otherwise I'd get Fatal error: exception Unix.Unix_error(Unix.ENOSYS, "shm_open", "/mempool_a20f4a40"). Maybe parany could check whether /dev/shm exists by itself and throw a descriptive error message otherwise?

No, my computer can allocate on GB of memory just fine. This is not about my computer which is a 64bit amd64 machine but about other people's computers which might be 32bit only. This includes architectures like armel, armhf, i386, mipsel, powerpc and x32. Specifically I ran into this problem because my package botch makes use of parmap and failed to build from source after ocaml in Debian upgraded from 4.05.0 to 4.08.1. See the build results here https://buildd.debian.org/status/package.php?p=botch and the segmentation faults in the build logs at the bottom.

The good news is, that I don't seem to need the "ugly catching and retry code" anymore. The bad news is, that parany as well as the cpu module are not in Debian yet. I packaged both of them already but it will take a few months until they pass the Debian NEW queue. Thanks for your help!

François Bérenger · Answer 19 · Thu Dec 19 2019 17:18:14 GMT+0800 (China Standard Time)

Parany doesn't do anything with /dev/shm explicitely.
That looks like a problem with ocamlnet, which is trying to open a file in there.

François Bérenger · Answer 20 · Thu Dec 19 2019 17:21:17 GMT+0800 (China Standard Time)

You can describe the problem you had and the special system you are using there (I guess not having /dev is quite exotic):
https://gitlab.com/gerdstolpmann/lib-ocamlnet3/issues

josch · Answer 21 · Thu Dec 19 2019 17:46:04 GMT+0800 (China Standard Time)

It seems I celebrated too early. List.map as well as Parmap.parmap yielded bit-by-bit reproducible output in multiple invocations. With Parany.run I get slightly different output every time. I have to investigate the differences...

François Bérenger · Answer 22 · Fri Dec 20 2019 08:56:55 GMT+0800 (China Standard Time)

The parmap I sent you doesn't preserve the input order (because it would be at the detriment of parallelization efficiency).

josch · Answer 23 · Fri Dec 20 2019 14:34:51 GMT+0800 (China Standard Time)

Indeed. I adjusted the wrapper like this:

                     let inarray = Array.of_list l in
                     let outarray = Array.make (List.length l) (0,[]) in
                     let counter = ref 0 in
                     let demux () =
                       if !counter = List.length l then
                         raise Parany.End_of_input
                       else
                         let res = (!counter, inarray.(!counter)) in
                         incr counter;
                         res
                     in
                     let mux (i,x) = outarray.(i) <- x in
                     let fwrapper (i,x) = (i,f x) in
                    (* for safety *)
                    Parany.set_copy_on_work ();
                    Parany.set_copy_on_mux ();
                    Parany.set_shm_size (1 * 1024 * 1024);
                    (* parallel work *)
                    Parany.run ~verbose:false ~csize ~nprocs:ncores ~demux ~work:fwrapper ~mux;
                    Array.to_list outarray

That works fine on my amd64 machine but on i386 I get: Fatal error: exception Netsys_mem.Out_of_space

François Bérenger · Answer 24 · Fri Dec 20 2019 15:23:10 GMT+0800 (China Standard Time)

Without a full stack trace, it is almost impossible for me to help you.
Maybe try to reduce even more the shm size, that would be my best current blind guess.

josch · Answer 25 · Fri Dec 20 2019 16:27:05 GMT+0800 (China Standard Time)

I solved it by splitting the list into smaller chunks in case Netsys_mem.Out_of_space is thrown. So this works now:

    let rec splitjob maxsplits todo =
       try
            let parmap ~ncores ?(csize = 10) f l =
              if ncores <= 1 then List.map f l
              else
                 let inarray = Array.of_list l in
                 let outarray = Array.make (List.length l) (0,[]) in
                 let counter = ref 0 in
                 let demux () =
                   if !counter = List.length l then
                     raise Parany.End_of_input
                   else
                     let res = (!counter, inarray.(!counter)) in
                     incr counter;
                     res
                 in
                 let mux (i,x) = outarray.(i) <- x in
                 let fwrapper (i,x) = (i,f x) in
                (* for safety *)
                Parany.set_copy_on_work ();
                Parany.set_copy_on_mux ();
                Parany.set_shm_size (128 * 1024 * 1024);
                (* parallel work *)
                Parany.run ~verbose:false ~csize ~nprocs:ncores ~demux ~work:fwrapper ~mux;
                Array.to_list outarray
            in parmap ~ncores:num_cores worker todo
        with Netsys_mem.Out_of_space -> begin
          if maxsplits <= 0 then fatal "exceeded maximum split depth";
          let l1, l2 = List.split_nth ((List.length todo)/2) todo in
          let res1 = splitjob (maxsplits-1) l1 in
          let res2 = splitjob (maxsplits-1) l2 in
          List.append res1 res2
        end
    in
    let maxsplits = 10 in
    splitjob maxsplits todo

A csize of 10 and shm size of 128 MB seems to be alright.