JuliaLang / julia

The Julia Programming Language

Home Page:https://julialang.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mmap failure with address space quotas

JonathanAnderson opened this issue · comments

I'm having a problem where when I build from 3c7136e when I run julia, I get the error could not allocate pools If I run as a different user, Julia runs successfully.

I think there might be something specific to my user on this box, but I am happy to help identify what is happening here.

I think this is related to #8699

also, from the julia-users group: https://groups.google.com/forum/#!topic/julia-users/FSIC1E6aaXk

This error is from a failing mmap, where we try to allocate 8GB of virtual address space. There might be a quota on virtual memory for some users.

Oops, misread the issue, sorry.

-v: address space (kb) 8000000 (from the julia-users thread) seems to indicate that a 8GB allocation is guaranteed to cause trouble

@carnaval Could we decrease this to, say, 4GB, to make this issue less likely?

an 8gb array is also too large for msvc to compile, fwiw

Can you try reducing the number on

julia/src/gc.c

Line 88 in e1d6e56

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap
by a factor of 2 or 4, see if it helps? I could also make a test branch with that change to have the buildbot make test binaries if that would be easier.

We can certainly lower this. I set it that high under the reasoning that address space was essentially free on x64. As I understand it, operations are either O(f(number of memory mappings)) or O(f(size of resident portion)) so it should not hurt performance.
I didn't think of arbitrary quotas, but it's probably better to ask, does anyone know any other drawback in allocating "unreasonable" amounts of virtual memory on 64bit arch ?

@carnaval Yes, indeed... lots of performance issues if you have very large amounts of memory mapped... which is why people use huge page support...

Keep in mind I'm still talking about uncomitted memory. The advantage of huge pages is reducing TLB contention as far as I know, and uncomitted memory sure won't end up in the TLB.

Generally, as far as my understanding of the kernel VM system goes, "dense" data structures (such as the page table, for which the TLB acts as a cache) are only filled with committed memory. The mapping itself stays into a "sparse" structure (like a list of mappings), so you only pay costs relative to the number of mappings. I may be wrong though, so I'll be happy to be corrected.

I'm talking about memory that has actually been touched, i.e. commited.
The issue is if you have an (opt-in at least) limit in the language, instead of just relying on things like ulimit, you can (at least in my experience) better control things, keep things from getting to the point where the OS goes bellyup. Say you have 60,000 processes running, which you know only need say 128M (unless they somehow get out of control, due to some bug)... having the limit protects you.
You may also have different classes of processes that need more memory (say, loading a huge XML document), it's important to be able to also be able to allow those to dynamically (based on user roles) have a higher limit).

That's not what my question was about though. We are already careful to decommit useless pages.

The limit is another issue, to enforce it strictly would probably require parsing /proc/self/smaps from time to time anyway to be sure some C library is not sneaking around making mappings.

Yes, but does the current system ever try to proactively cut down on caches, etc., so that it can free up some memory?

It doesn't really have to be done strictly, to be useful, without fancy approaches like parsing /proc/...
also, for people embedding julia, couldn't things be compiled so that at least malloc/calloc/realloc end up using a julia version, that does keep track?
Having some facility to try to increase stability is better than none, even if it can't handle external memory pressures.

I'm not arguing that we should not do those things. But those are features. I was just trying to check if someone knew that some kernel would be slow with large mappings : it would be a regression, not a missing feature.

commented

I'm running into a could not allocate pools issue on a new build on a new machine (0.3 works fine).
(Not sure whether this warrants a new issue or not, let me know.)

It builds fine but it crashes on running the tests, the culprit is addprocs:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6683 (2015-08-12 17:53 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 103f7a3* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> addprocs(3; exeflags=`--check-bounds=yes --depwarn=error`)
could not allocate pools

However, addprocs(2; exeflags=``--check-bounds=yes --depwarn=error``) works. Also starting more than three REPLs at once produces the error.

As far as I can tell there are no relevant ulimits:

 $ ulimit -a
-t: cpu time (seconds)         unlimited
-f: file size (blocks)         unlimited
-d: data seg size (kbytes)     unlimited
-s: stack size (kbytes)        8192
-c: core file size (blocks)    0
-m: resident set size (kbytes) unlimited
-u: processes                  63889
-n: file descriptors           1024
-l: locked-in-memory size (kb) 64
-v: address space (kb)         unlimited
-x: file locks                 unlimited
-i: pending signals            63889
-q: bytes in POSIX msg queues  819200
-e: max nice                   0
-r: max rt priority            0
-N 15:                         unlimited

On my normal machine the -l option is unlimited, but limiting it there to 64 does not reproduce this behavior.

commented

The same problem arises using the Julia nightlies julia-0.4.0-24a92a9f5d-linux64.tar.gz.

Any ideas on how I could resolve this? Should I contact the admin of that machine to change some settings?

yes, you can remove the 16* here

julia/src/gc.c

Line 164 in f40da0f

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap
and recompile.

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

commented

Yes, that works, thanks! Just to clarify, my understanding from this thread is that it is limiting the -v: address space (kb) which causes this. However, this is unlimited on my machine. So which one is the culprit?

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

@carnaval I know it is a fairly common practice in HPC clusters using a sun grid engine. Specifically, virtual memory (h_vmem) is limited to be the same size as the physical memory requested for the job (mem_free) in order to avoid oversubscription of memory:

https://arc.liv.ac.uk/pipermail/gridengine-users/2005-April/004643.html (see discussion here)
https://jhpce.jhu.edu/2015/07/28/jhpce-cluster-change-to-set-user-memory-limit-tuesday-august-4th-from-600-pm-700-pm/ (see very last item here)

Therefore by having the settings the way they are presently, this forces all julia jobs to request at least ~8-9GB of ram. The default for my SGE system is 5GB (so julia crashes by default), and it costs extra money to run with more RAM, while blocking other users access to that RAM that you might not need to use otherwise. I suspect that many julia users are in HPC environments where this is the case. For us, virtual memory is exactly as costly as real memory.

Hope this information is helpful; I know very little about GC or julia internals, but love the work you all continue to do here.

This issue is preventing v0.4.0 from being used where 0.3.x worked fine.
I have a RHEL 5.9 machine, 16GB x64 and many packages fail to compile with the 'could not allocate pools' error. I do not appear to have address space limits set, similar to mauro3, yet this issue is happening for me.

Recompiling from source is not an option for everyone who do not have full control of their operating environment to have all necessary dev-tools, so if this memory allocation is not required, can the edit please be made the reduce the allocation, so this will end up in the nightlies?

I am also getting this error. I managed to compile julia-0.4 if I don't start Xorg, so I was happy, but then I get the error when julia tries to pre-compile a module.

My limits look right though

~  ᐅ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       31567
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  unlimited
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 31567
-q: bytes in POSIX msg queues       819200
-e: max nice                        30
-r: max rt priority                 99
-N 15:                              unlimited

I compiled from source so I can modify the code and make it work, but this may be a no-go for normal users.

[EDIT: I have 8 gigs of ram]

As @carnaval said:

yes, you can remove the 16* here

julia/src/gc.c

Line 164 in f40da0f

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

Your commit is the only silly thing here. Why do you think malloc() has a size argument at all? Why should one ever malloc something less than 128G?

Having such a show stopper unfixed since almost two years is ridiculous. Just fix that please.

@rudimeier, Julia is open source. That means you can fix bugs yourself! Rejoice and be thankful.

Seems that the good practice of disabling overcommit globally was not mentioned yet.

$ cat /proc/sys/vm/overcommit_memory
0

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Virtual memory is not always "cheap" in opposite to the comment in 7c8acce. On systems where good admins have set overcommit_memory=0 julia would really steal 8GB for nothing.
Many distros have overcommit_memory=0 per default, specially the enterprise ones.

Moreover:
http://www.etalabs.net/overcommit.html
"Overcommit is harmful because it encourages, and provides a wrong but plausible argument for, writing bad software."

BTW Julia would even malloc 8 GB on a machine with less physical memory available. Then there is no way to prevent a crash if the user really starts using it. The only way to prevent this is overcommit_memory=0. But then julia does not even start ... silly.

@timholy I am not a julia user. I'm admin of users who want's to use it. I will not change our memory settings and limits just to let them run broken software.

yes, you can remove the 16* here

julia/src/gc.c

Line 164 in f40da0f

#define REGION_PG_COUNT 16*8*4096 // 8G because virtual memory is cheap
and recompile.

@rudimeier Please use a more constructive tone. Julia developers didn't ask you to do anything. Nobody here forces you to use "broken" software.

If you ask kindly and with solid technical arguments, we'll be happy to make this work for your users, but complaining with rude words is more likely to be counter-productive.

@nalimilan sorry if I sound rude. Others have posted already many technical arguments and me in my second post too. Nevertheless the whole bug report sounds like this will never be fixed anyway.

Julia simply locks 8GB physical memory when running on a default Linux kernel (If it runs at all.). Moreover it will not work (at least not well) on clusters or enterprise systems where you almost certainly have to deal with ulimits.

the default linux kernel defaults to having no issue with this (per your link above):

0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slightly more memory in this mode. This is the
default.

and it may be slightly more efficient due to fewer syscalls (quoting carnaval above, "As I understand it, operations are either O(f(number of memory mappings)) or O(f(size of resident portion)")

Julia simply locks 8GB physical memory

this is never actually true, even with overcommit turned off. virtual memory != physical memory.

Moreover it will not work (at least not well) on clusters

I would have expected that a cluster that uses this flag also ensures that each program is guaranteed to be able to allocate it's full allotment. Setting that value currently involves changing a constant and compiling it into gc.c. However, other than that, it shouldn't have any impact on the other users of the cluster (the scheduler should not have permitted them to allocate this memory either) or on Julia (which subdivides this pool on-demand to meet the needs of the program).

julia developers would likely accept a PR to fix this, but the usage of ulimit in this way seems to me to just create additional non-critical work, without providing any actual benefit.

(yes, I looked at the etalabs link above. However, I respectfully disagree with their premises and analogies. graceful malloc failures are good for dealing with bad requests like (int)-1, but not a solution for avoiding OOM. Even if that app is careful about handling malloc failure to revert the database state, does it plan on also handling the ENOMEM failure of read from said OOM condition? Further, why is this database not designed to be robust in the presence of unexpected termination? My impression is that this article is written from the perspective of "software written in C", without real consideration of all of the alternative approaches to language design that exist.)

@rudimeier, what specific issue are your users encountering when they try to run Julia?

On Thursday 12 November 2015, Jameson Nash wrote:

the default linux kernel defaults to having no issue with this (per
your link above) [...]

Oops, my memory was wrong about the default and it's meaning. Actually
overcommit_memory = 2 is the case I mean.

Julia simply locks 8GB physical memory

this is never actually true, even with overcommit turned off. virtual
memory != physical memory.

In case overcommit_memory = 2, each Julia process consumes 8G
of /proc/meminfo's "CommitLimit". On a machine with 16G memory and no
swap you couldn't start two Julia processes.

Moreover it will not work (at least not well) on clusters

I would have expected that a cluster that uses this flag also ensures
that each program is guaranteed to be able to allocate it's full
allotment.

overcommit_memory is for the whole system ... all processes. ulimit is
per process. Who knows on other non-linux systems you might even have
per-user limits or other defaults or even "overcommit unsupported" at
all.

julia developers would likely accept a PR to fix this, but the usage
of ulimit in this way seems to me to just create additional
non-critical work, without providing any actual benefit.

The thing is that on clusters or other multiuser systems the admin has
to prevent OOM situations like we need to have HD quotas. No user
should affect any other user's job. On Linux it's not possible to set
a "resident memory limit". Thats why we limit virtual memory. It's
not "silly" it's just what we have to do.

On systems where a scheduler manages the count of parallel running jobs
you would use ulimit to restrict each job so that the sum of all jobs
can't exceed the total available memory. On interactively used
multiuser machines you would use overcommit_memory = 2 so that all
users together at least can't crash the whole machine but hopefully
only each other's processes.

(yes, I looked at the etalabs link above. However, I respectfully
disagree with their premises and analogies. graceful malloc failures
are good for dealing with bad requests like (int)-1, but not a
solution for avoiding OOM. Even if that app is careful about handling
malloc failure to revert the database state, does it plan on also
handling the ENOMEM failure of read from said OOM condition?
Further, why is this database not designed to be robust in the
presence of unexpected termination? My impression is that this
article is written from the perspective of "software written in C",
without real consideration of all of the alternative approaches to
language design that exist.)

Don't take the analogies too seriously. Just be aware that this
overcommit design only works on systems where admins don't care about
stability.

For me 8GB looks just randomly choosen. 512MB hardcoded malloc would be
IMO bad enough but shouldn't be too much overhead on any recent
real-life system.

We're all for adding options to make this work for you, but to do that we need to be able to reproduce the problem and then figure out what to do about it. Can you please post some details about how to make this happen since it doesn't happen with a vanilla Linux configuration? It also doesn't happen on OS X or Windows in the default configuration.

we need to be able to reproduce the problem and then figure out what to do about it. Can you please post some details about how to make this happen since it doesn't happen with a vanilla Linux configuration? It also doesn't happen on OS X or Windows in the default configuration.

I cannot speak to others' errors here (in particular, @mauro3 said above that they still experienced this issue even though their ulimit -v was set to unlimited). But to mimic the problem that users such as myself experience on a computing cluster, you can make a bash sub-shell (so you don't have to mess with the settings of your machine) that sets a limit of 5GB and open julia in that subshell:

( ulimit -v 5000000; julia )

For me on my vanilla install of linux mint, this command causes the could not allocate pools error to arise, even thought the command julia runs just fine on its own.

Hope that is helpful, sorry if it is a trivial/non-useful example. As a very happy user, thanks again for this great project.

That's very helpful – thank you. @carnaval, what are the options here? Should we introduce an environment variable that controls how big that pool is? Can we do without it altogether?

The intent of the code is to reserve address space (uncommitted memory). It is entirely possible that the way we do it on linux relies on overcommit being enabled and that would be a bug. If there is no way to do it without overcommit I'd be quite sad given how easy it is on windows.

That does not solve arbitrary human limits though. If ulimit -v is a common thing to do then we need to deal with it. We can make those things variable length and have a CLI limit option.

@carnaval on linux the overcommit is not counted against the process if is mapped without PROT_WRITE (https://www.kernel.org/doc/Documentation/vm/overcommit-accounting). the commit charge is later adjusted when mprotect is called to make that range writable (or returns ENOMEM if that would cause an overcommit). the commit charge cannot be adjusted back down without munmapping the region.

Unfortunately, this goes beyond clusters.
I don't want to add noise to the thread, but as I commented above [https://github.com//issues/10390#issuecomment-149851612] I cannot run julia v0.4 on an 8GB RAM Arch-linux distro if KDE is running, for example.

That seems to be a typical scenario, and I found no way to fix it but modifying the code and re-compiling. I feel many people will find this annoying and it may stop them from trying julia after trying to install a package, causing a sudden program exit.

#13968 should fix most of the concerns here but I still feel bad having useless pages count against the committed size.

@vtjnash if I read that correctly the best we can do is at least start with prot_none pages and mprotect them on the first allocation ? It's weird that there is no proper way to decommit memory, even though obviously infinite overcommit "solves" the problem. I would have thought that at least some combination of madvise & mprotect would do the job.

I confirm that this change (setting REGION_PG_COUNT to 8*4096) corrects the build problem on Stampede.

I suggest to make this the default, and to backport it to 0.4, as this solves a show-stopping problem on some systems. (Other solutions would work as well, e.g. checking whether the memory could be allocated and backing off if it can't be allocated, or calling ulimit to find out how much to allocate, etc.) However, I think it's important that Julia works "out of the box".

Touching pages isn't very fast on Linux anyway. I find that you can rarely achieve more than a few GByte per second. (I'm talking e.g. about calling memset the first time right after malloc; calling memset the second time is much faster.) Given this, the overhead of an mmap call every 0.5 GByte doesn't seem too important.

Regarding setting strict virtual memory limits: This does make sense on HPC systems for various reasons. Most importantly, they exist, and they won't go away -- this is part of the policy one has to accept when using large HPC systems. I'd prefer for Julia to "just work" under such circumstances.

rebase and review #13968

I am also suffering from an aspect of this problem. Not from ulimit, but rather trying to run on a system with 512M of memory and no swap. Setting overcommit_memory to 1 tends to help but not always. This is for ARM. Julia starts out with a VM size set to 620M. Though this is not the same problem as the one reported in this issue, I noted that similar ones are being closed and marked duplicate, so here I am.

I have access to three clusters at different universities and I had to set REGION_PG_COUNT to 8*4096 to get Julia working on two of them. I'd therefore also really be grateful if this issue was resolved!

@carnaval @yuyichao what would the performance consequences be of just lowering that default value by a factor of 2? 4? 8? Whatever brings us in the realm of "below ulimits people seem to be hitting"

I'll have a look at the performance impact of related GC changes this weekend.

As for me, this is happening on ARM for unclear reasons. The GC memory space is currently not expandable I take it. If it were, I think minimal low-end stuff could afford a heap size of at least 64M, without an issue while expecting size approaching 1G is ridiculous. Somewhere in between is the target.

Additionally, I request that this needs to be configurable via library jl_init or similar, and not expect to be controlled by running julia the executable.

The arm issue is completely different. This is only an issue for those who cannot control the virtual address limit. The amount of physical memory is irrelevant here.

Well since the error message is the same it at least seems related. Are you saying this happens not because of the size of the allocation but the location? These aren't related? The previous discussion made it sound that people were having problems because the system prevented oversubscription which seems to indicate a problem with size. Well if that's the case, then that kind of makes sense too; it is true that there is a lack of OS support for 64-bit virtual addresses.

Attempting to compile on XSEDE's Comet raised this error. Removing 16* from gc.c allowed compilation to continue.

Comet's front end has a severely restricted memory limit setting (ulimit). You can only allocate 2 GByte. The solution is to request a compute node interactively, and build there:

/share/apps/compute/interactive/qsubi.bash -p debug --nodes=1 --ntasks-per-node=24 -t 00:30:00 --export=ALL

Hi all,
is this going to be backported to 0.4.x at some point? I'm stuck with this problem on a cluster. thanks!

#16385 was a pretty large change, I'm not sure whether it can be easily backported. Are you building from source or using binaries? If the former, just change the number in the code and recompile. If the latter, I guess we could trigger an unofficial build with a smaller value.

i was using binaries. building is a nightmare on that system as well, I run into diskspace quota exceeded on the login node all the time, and i can't get the build to work on a compute node either. If you can trigger an unofficial 0.4.5 build that would save my week. thanks.

might take a while to build, but check back at https://build.julialang.org/builders/package_tarball64/builds/435 and when it's done it should be available at https://julianightlies.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.6-c7cd8171df-linux64.tar.gz (assuming you want 64 bit linux, and dropping by a factor of 8 will get you below your ulimit)

Awesome! Thanks.

On Wednesday, 8 June 2016, Tony Kelman notifications@github.com wrote:

might take a while to build, but check back at
https://build.julialang.org/builders/package_tarball64/builds/435 and
when it's done it should be available at
julianightlies.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.6-c7cd8171df-linux64.tar.gz
(assuming you want 64 bit linux)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10390 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA-WdofbXDWPFFjmNbso-fAX6Dq4Dk_Bks5qJpt5gaJpZM4Doy6D
.

@tkelman thanks so much works out of the box like a charm! so much for "broken software". outstanding support as usual. 👍 👍 👍

commented

In case someone else stumbles over this: I was under the impression that this issue was resolve but it still surfaced for me with 0.5-rc4 binaries and source build for rc4, see #18477.

The error now looks a bit different for me with either just hanging at the tests when running make testall or when doing addprocs with a suitably high number I get Master process (id 1) could not connect within 60.0 seconds. The fix is as before but now in file src/gc-pages.c.

Reopened to be fixed in 0.5.x.

As mentioned in the related issue, this is really #17987. It does not fail because we are asking for a huge fixed size anymore and the remaining is better handled by allowing users with special memory constraint to specify that directly.

Sorry to bother with this but I am still looking for a solution to this problem. I am working on a cluster where I have to request the max amount of virtual and physical memory that I will be using, and I have to request very large amounts in order for my job to run at all. This puts me on a significantly longer queue, because I basically need an entire compute node all for myself. julia v0.5-rc3.

my job has the following memory requirements when run on a single compute node on that same cluster.

Any advice for how to deal with this greatly appreciated.