saopicc / DDFacet

Hi,

I've been using DDFacet as part of the ddf-pipeline and have looked into getting it running on several HPC systems. I'm finding the /dev/shm requirements to be particularly limiting on shared HPC systems where requesting 100% of the RAM to be accessible by /dev/shm is not an option (as the standard across linux systems is 50%). Thus, in such circumstances, one needs to request double the amount of memory required which is suboptimal for several reasons (e.g. it occupies resources that could otherwise be used by others and limits one to only being able to use large-memory nodes, as is the case for me anyway). I'm wondering whether it would be feasible to implement an option to use fast local storage (e.g. SSD or nvme drives, which for instance is abundant on the hpc cluster here and shouldn't be significantly slower) as opposed to /dev/shm. I'd be very interested to hear the developers' thoughts on this!
Thanks, Kelly.

Hi My feeling is it would be substantially slower if you want things to go through a disk-based cache (zarr on disk arrays perhaps...) with SSD drives (most of the operations on visibility sets and large image buffers go through shm primarily to avoid the overheads of IPC - this is the primary reason for using shared memory). It is conceivable to put a proxy in front of the shared memory layer to just pass loaded cached objects directly as pickled objects, with the caveat that there will be huge performance and memory implications (each process will inherently have its own memory set, so the memory requirements will scale as NCPU which will probably bound you to a handful of cores) - particularly with Lofar/ska-era map sizes and degridding operations. The other alternative is to perhaps make this go through a ramdisk ( which your cluster admin will need to mount for you), but I haven't really explored the limitations of such a solution, so it may be bound by similar OS limitations as shm. All that said- It is not really needed (or advisable) to request 100%, but can't your sysadmin remount the shared memory to say 80-90% as a middle ground here - the factory imposed 50% defaults on Ubuntu is a rather superficial either way? I'm not sure which is more feasible though. @cyriltasse may have more ideas.

…

On Thu, 24 Mar 2022, 03:36 kgourdji, ***@***.***> wrote: Hi, I've been using DDFacet as part of the ddf-pipeline and have looked into getting it running on several HPC systems. I'm finding the /dev/shm requirements to be particularly limiting on shared HPC systems where requesting 100% of the RAM to be accessible by /dev/shm is not an option (as the standard across linux systems is 50%). Thus, in such circumstances, one needs to request double the amount of memory required which is suboptimal for several reasons (e.g. it occupies resources that could otherwise be used by others and limits one to only being able to use large-memory nodes, as is the case for me anyway). I'm wondering whether it would be feasible to implement an option to use fast local storage (e.g. SSD or nvme drives, which for instance is abundant on the hpc cluster here and shouldn't be significantly slower) as opposed to /dev/shm. I'd be very interested to hear the developers' thoughts on this! Thanks, Kelly. — Reply to this email directly, view it on GitHub <#35>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6RGQXC6DNHOVUMA5Y3VBPBKTANCNFSM5RPVOL2A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

I should just point out that any disk-based cache system will still require your sysadmin to change the ulimits on number of open files -- the 65k standard limit isn't enough considering how many image planes and visibility chunks need to be stored - I know the HI folks regularly hit this limit with other imaging and cubestacking software, so default setting is not really unique to DDF.

…

On Thu, 24 Mar 2022, 07:02 Benna Hugo, ***@***.***> wrote: Hi My feeling is it would be substantially slower if you want things to go through a disk-based cache (zarr on disk arrays perhaps...) with SSD drives (most of the operations on visibility sets and large image buffers go through shm primarily to avoid the overheads of IPC - this is the primary reason for using shared memory). It is conceivable to put a proxy in front of the shared memory layer to just pass loaded cached objects directly as pickled objects, with the caveat that there will be huge performance and memory implications (each process will inherently have its own memory set, so the memory requirements will scale as NCPU which will probably bound you to a handful of cores) - particularly with Lofar/ska-era map sizes and degridding operations. The other alternative is to perhaps make this go through a ramdisk ( which your cluster admin will need to mount for you), but I haven't really explored the limitations of such a solution, so it may be bound by similar OS limitations as shm. All that said- It is not really needed (or advisable) to request 100%, but can't your sysadmin remount the shared memory to say 80-90% as a middle ground here - the factory imposed 50% defaults on Ubuntu is a rather superficial either way? I'm not sure which is more feasible though. @cyriltasse may have more ideas. On Thu, 24 Mar 2022, 03:36 kgourdji, ***@***.***> wrote: > Hi, > > I've been using DDFacet as part of the ddf-pipeline and have looked into > getting it running on several HPC systems. I'm finding the /dev/shm > requirements to be particularly limiting on shared HPC systems where > requesting 100% of the RAM to be accessible by /dev/shm is not an option > (as the standard across linux systems is 50%). Thus, in such circumstances, > one needs to request double the amount of memory required which is > suboptimal for several reasons (e.g. it occupies resources that could > otherwise be used by others and limits one to only being able to use > large-memory nodes, as is the case for me anyway). I'm wondering whether it > would be feasible to implement an option to use fast local storage (e.g. > SSD or nvme drives, which for instance is abundant on the hpc cluster here > and shouldn't be significantly slower) as opposed to /dev/shm. I'd be very > interested to hear the developers' thoughts on this! > Thanks, Kelly. > > — > Reply to this email directly, view it on GitHub > <#35>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB4RE6RGQXC6DNHOVUMA5Y3VBPBKTANCNFSM5RPVOL2A> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

Thinking about this a bit more zarr would not be a good solution for a on-disk cache for exactly the reason I highlighted in paragraph 2 - each process would have to read all the visibilities and image buffers and have its own set of memory in the process.

One could possibly replace some of the /dev/shm with mmaps, but there are caviats looking online:

it can only be used on proper file systems - not (e.g. s3) bucket based store solutions
there are again OS-imposed limations on the sizes that can be allocated, e.g. https://stackoverflow.com/questions/2798330/maximum-memory-which-malloc-can-allocate/57687432#57687432
You are bound to a similar 65k named pages limit which can be set in the kernel:
https://www.systutorials.com/maximum-number-of-mmaped-ranges-and-how-to-set-it-on-linux/

So the solution trades one set of linux kernel configuration for another here, so not sure if it is really worth the dev time, but possibly a nice to have when there is bandwidth to implement.

Hi @bennahugo

sysadmin/systems analyst/whatever for one of @kgourdji's clusters here. I'm ex-astro but on the C/MPI/simulation side, not observations. perhaps you can clarify some things for me.

AFAICT DDF is using files in /dev/shm/ to communicate and share data between threads/processes.
do you mmap the files, or simply write/read? its not clear from DDFacet/Other/README-APP.md what your 'attach' and i/o mechanism is.

confusingly, you also talk about a 'ramdisk' above, but /dev/shm is just a ramdisk. it is the same as any other ramdisk/tmpfs in linux.

we (and many HPC clusters) have large local SSD/NVMe's in all compute nodes. Linux aggressively caches all i/o in ram (VFS, page cache etc.) and does not flush it to disk frequently (only when ram runs low, or about every 30s), so putting files onto a pure ramdisk like /dev/shm is not actually always a win. in fact it's often just a 2x waste of ram. ie. linux caches the data in VFS, and a copy also lives in a "file" on /dev/shm too.
consequently, we often find that using a local SSD is just as fast as /dev/shm, with the added bonuses that it doesn't waste 2x ram, and can be much larger in size than any /dev/shm (which, being a ramdisk, is physically limited to the size of ram).
have you experimented with a backing store other than /dev/shm?

I guess if you are constantly writing many GB/s to /dev/shm, then an NVMe/SSD would not be able to keep up, and then using /dev/shm would be fair enough.
do you have an approx i/o rate the code is running at? how about the size of each i/o?

cheers,
robin

@plaguedbypenguins the ramdisk was just a suggested alternative -- it would fulfil the same purpose as /dev/shm (but without the requirement of changing default kernel behaviour if that is the issue)

The sharedarray code is not actually implemented in DDF - we use the shared-array package. Looking at this the blocks are created and mmapped to answer your question https://gitlab.com/tenzing/shared-array/-/blob/master/src/shared_array_create.c

To answer the second question - that is indeed the case - the irregular sampled visibilities (100s of GiB worth) and imaging buffers (typically perhaps few 10s of GiB dep on field of view), are modified in the major cycle of the imaging algorithm (the deconvolved model is subtracted from residuals to start the next round of deconvolution).

Each process, unfortunately, therefore needs a writeable set of memory pages of the same irregular and regular sampled fourier data, which is where the issue comes in -- you want to avoid duplicating those arrays across processes for a many-core machine.

As for IO the raw data (without flags, weights, etc.) coming off the telescope for MeerKAT is given here https://skaafrica.atlassian.net/wiki/spaces/ESDKB/pages/277315585/MeerKAT+specifications#Data-rates, typically after calibration this can be reduced by around a factor of perhaps 3 for channel averaging. The major-cycle inversion steps (forward [regular to irregular sampled] and backward [irregular to regular] fourier transform from this data typically accounts for the vast majority of the runtime since multiple such cycles are done (perhaps up to say 70%-80%). Typically it will take a day or so to deconvolve a large calibrated averaged database (few 100 GiB) on a reasonably new (<4 years) multi-cpu node with a few of these cycles.

As it stands with the shm shared-data implementation in DDF for MeerKAT one cannot go under ~300GiB (with say 90-95% remounted /dev/shm) very easily without taking substantial hits in runtime from finer dataset chunking. If one were to replicate memory across processes and do reductions afterwards I imagine one would not be able to utilize more than say a dosen cores before running out of memory.

As for IO/s operations to and from disk is a bit hard to quantify here -- it is really bound by the current parallel-read limitations of the casacore table system, so it is substantially lower than the theoretical capability of the disks. (edit: they are also done in bursts in these major cycles) More detailed profiling would be needed for the particular use case to quantify - it is heavily dependent on the telescope, correlator mode and field of view - I'm not sure what @kgourdji is working on from the description. I presume it is LOFAR data?

Shared memory is just memory really, just handled slightly differently by the kernel (and has to be used with some care, cos it is not deallocated when the process dies for example). But even in HPC a user could use all the non-shared RAM availalable and forbid others to use it, unless there's a limit. And if there's a limit, then it's the same problem, right?

I'm not an expert, but my feeling is that sysadmins are just afraid of their own computers, and they don't want to do anything that is non standard... :)

About mapping memory onto a disk file it's already a possibility as explained here

The shared memory is identified by name, which can use the file:// prefix to indicate that the data backend will be a file, or shm:// to indicate that the data backend shall be a POSIX shared memory object. For backward compatibility shm:// is assumed when no prefix is given. Most operating systems implement strong file caching so using a file as a data backend won’t usually affect performance.

We've tried that in the past right @o-smirnov ? but we've never really used it. It could be useful in the future if we wanted to work on like big images that don't fit into RAM

@bennahugo thanks for that explanation. I understood none of the observational terms :)

I think I understand that you have python multiprocess firing up threads/tasks that then have mmap'd files into their address space to avoid data replication, and that's how you pass data between the tasks. cool.

I see minimal actual true "shared memory" segments (via ipcs etc.) on the nodes, so I think you are mistaken that you are using large "shared memory" in the Unix way. what you are actually using is mmap'd files into a tmpfs. that tmpfs just happens to be called /dev/shm, and is mounted at the path usually used by real shared memory.

but anyway, as your colleage has ninja'd me - the first thing I see in https://gitlab.com/tenzing/shared-array is ->

SharedArray.create(name, shape, dtype=float)

This function creates an array in shared memory and returns a numpy array that uses the shared memory as data backend.

The shared memory is identified by name, which can use the file:// prefix to indicate that the data backend will be a file, or shm:// to indicate that the data backend shall be a POSIX shared memory object. For backward compatibility shm:// is assumed when no prefix is given. Most operating systems implement strong file caching so using a file as a data backend won't usually affect performance.

(emphasis is mine)
which was what I was trying to convey above, but this says it much more concisely.

have you tried simply using file:// instead of shm:// ?

mmap's work just fine between processes, so all of shared_array and thus DDF would continue to work.
hopefully the only thing that would change is that the mmap'd files would be backed by files on a SSD instead of files using up ram on the /dev/shm/ tmpfs/ramdisk.

if i/o the DDF threads do to the mmap's were sufficiently read-biased, or writes (averaged over 30s) were less than say 1 GB/s then there should be no difference in DDF speed.

also, I'm used to thinking of 100's of cores and MPI codes, not quirky single node things like this, so I'm not up on all the tricks you folks play.

cheers,
robin

Sure we could perhaps swap out the backend to use a pure mmap to another file system. It should not be too much of a change. The limits on number of open files and mmap limits etc. still apply though, so there are some kernel configuration that has to be done.

Thanks it is a useful discussion to have -- one would have to profile to really quantify the differences

I will migrate the issue to the development repo. Perhaps it can be added for the release after next.

@cyriltasse jobs in all HPC are almost always in cgroups (cpuset, memcg, etc.) and cannot use more cores or ram than they have asked for. there are limits everywhere, for good reason. there are 1000's of jobs running on clusters, often dozens on each node. they cannot be allowed interfere with each other, or be able to crash nodes.

HPC is not simple. please do not assume that we do things for no reason.

if it turns out that DDF with file:// to local SSD is significantly slower than shm:// then we can think about the possibility of raising the /dev/shm size from 50%. as we limit the use of ram (== /dev/shm) via memcg anyway, then it may have few implications.

FYI we aren't hitting any other issues with DDF at the moment - open fd's, mmap etc. - not a problem so far.
affinity in DDF is broken 'cos it doesn't know about cpusets, but maybe you already have a ticket open for that.

hmm, @bennahugo "release after next" doesn't sound like soon.

if it helps, we'll let you know what we find hacking the code... doesn't seem that hard to change "shm://" to "file://" to me, but I guess the devil is in the details.

Ah, now that brings back memories. I recall trying to switch (at least parts of) DDF to the file:// backend for SharedArray way back in the past (attractive for all the reasons stated above, plus it could be integrated into our caching system neatly), and not achieving the desired results, whether due to lack of competence, or just insufficient persistence. This thread may be a good motivator to rekindle those experiments. Now if only we could persuade @cyriltasse to clean up his shm use in SSD2 so that everything goes through single SharedDict and doesn't allocate randomly named segments scattergun, that would make it a lot easier to experiment with a centralized solution!

Now if only we could persuade @cyriltasse to clean up his shm use in SSD2 so that everything goes through single SharedDict and doesn't allocate randomly named segments scattergun, that would make it a lot easier to experiment with a centralized solution!

Yeah we just discussed that with @bennahugo yesterday - I'll try to do that ASAP... Now trying to merge the non square images/cell into master...

oh, wait. you are already using file:// for the shared_dict arrays ->

DDFacet/DDFacet/Array/shared_dict.py

Line 25 in 4cbece5

return "file://" + path

maybe that's where some of the confusion comes from.

it's shm:// here though ->

DDFacet/DDFacet/Other/Multiprocessing.py

Line 118 in 4cbece5

return "shm://" + getShmName(name, **kw)

so (I'm guessing) most of the big stuff is mmap'd file:// i/o and some IPC's via shm:// ?

oh, wait. you are already using file:// for the shared_dict arrays

Ah yes indeed, I'd forgotten about that little hack. See comments directly above for the motivation.

So the issue was, as I last remember it before abandoning investigation, using file:///dev/shm/foo for shared_dict was a lot faster than using file:///some/local/fs/foo. I had naively expected Linux disk caching to mitigate any differences, but it didn't... If you have any tips on how to solve it, I'd be happy to give it another crack.

it's shm:// here though ->

That's the old and (soon to be hopefully) deprecated way.

Thanks all for the insightful discussion. Indeed, I'm processing an 8hr LOFAR HBA dataset.

Customize shared memory filesystem root