UCBerkeleySETI / blimpy

Breakthrough Listen I/O Methods for Python

Home Page:https://blimpy.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FIL & H5 readers need to automate data load requirements

texadactyl opened this issue · comments

When a .fil file or a .h5 file are read-accessed by blimpy, there are 2 possible outcomes:

  • Data array size <= 1 GB : Both header and data are loaded.
  • Data array size > 1 GB : The header is loaded but the data is not and a warning is displayed - "blimpy.io.base_reader WARNING Selection size of n.00 GB, exceeding our size limit 1.00 GB. Instance created, header loaded, but data not loaded, please try another (t,v) selection."

So, even if the data array size requirement is 1.1 GB in a 32 GB system, the data is not loaded and the warning message appears. A new blimpy calcload tool does produce the max_load=value needed but this is a manual band-aid.

Suggested enhancement:

  • If the data size required is more than (available less 1GB), that is an Exception (clear & concise message text) with traceback.
  • If the data size required < 50% of available, proceed without warning.
  • If the data size required > 50% of available but less than available, proceed with a warning.

So, in a system with 12 GB RAM available,

  • data size required >= 11 GB ---> Exception with traceback.
  • data size required = 4 GB ---> proceed without warning.
  • data size required = 7 GB ---> proceed with warning.

Is 50% a reasonable threshold? It was a WAG on my part (not even scientific).

I don't know whether blimpy reads into a caller allocated array or allocates an array and reads into it. It sounds like it allocates an array and reads into it. In that case, I would vote for limiting the size of the array to something like as many spectra as will fit in 4 GB or one complete spectrum, whichever is greater. To get around this limitation, if desired and well thought through, the caller could pre-allocate a larger array of the desired size and then use blimpy to read into the caller allocated array (obviously this capability would have to be added if it doesn't exist already). I think no matter what limits you put in place, users will still want to "just read everything into memory". If we want to put a hard limit on how much memory a user may use we could enforce ulimit limits, but that seems a bit draconian.

@david-macmahon Thanks.

Blimpy allows caller to specify max_load number of GB already. I would like some automation to replace the manual step of running calcload and then plugging that number into the max_load parameter in your source code for EACH file that the code will operate on. My calcload utility is better than what I found a month ago (nothing, just guess) but we could do better.

We also need to guard against an unreasonable automation calculation or max_load spec that would otherwise ask for too much RAM, resulting in "killed" or some bizarre Python exception.

The 50% number might be reasonable, but that's of the "total RAM", rather than "free RAM". Placing this threshold might then not achieve what we intend to do, which is protecting against allocating too much RAM.

I would then be inclined to say it's the users' job to ensure they are doing the correct thing. That's how I would write my own software. I'd raise a warning saying "FYI, we're allocating a lot of memory, things might crash", and move on with the code.

@wfarah Thanks.

We have to be careful on multi-user systems. While it is the application programmer's job to properly specify requirements, it is also incombent upon us to make sure that we do not inadvertently destabilize a system.

We do not know how long the used memory is reserved. There could be a long running job (E.g. turbo_seti) so we cannot count on any of the memory between free and total becoming available in the near future. This is conservative but it is safe.

I am not sure about "50%" but we could start with that figure since this is not an error message; execution continues.

@david-macmahon

Unfortunately, blimpy reads all data or none (header only). All over the code one can find this behaviour. Fortunately, this does not bother turbo_seti FindDoppler search() operations as it only relies on the h5 header from blimpy and reads the h5 data on its own in a subsetting manner (find_doppler/data_handler.py).

On the other hand, the find_event plot_event routines get into trouble with large data arrays because matplotlib insists on having all data in memory for plotting (no caching permitted).

The only limit I would like to impose is stopping someone from loading more than N-1 GB of data when N GB of RAM are free.

When blimpy is used as a library in other systems like turbo_seti, I think the behavior should either be to fail and raise an exception, or to succeed. The warning isn't user-actionable and the text is inaccurate when the calling library is already handling the low-memory case somehow. That is also how standard Python file tooling works - it will either raise an exception or fail, it won't print warnings if it's close to failing. When blimpy is called from the command line this sort of warning makes sense.

There are people using blimpy on their laptops, as well as on HPC servers, and we have amateur to guru users. So I think @texadactyl's suggestion of 50% is good, perhaps with a little flexibility (e.g. allow +/- 5% if a full time integration can fit in, as @david-macmahon is suggesting).

However ...

For filterbank files, I would like to use numpy.memmap so the data appears as a numpy array, but isn't loaded into memory until needed. The issue with this is supporting 2-bit and 4-bit data which is still common in the pulsar community. (This discussion is making me lean more toward memmap over low bitwidth support!)

An alternative I'd like to explore is using xarray which integrates with dask to do out-of-core operations.

I'd earmark both approaches as 'experimental' and for a future major release. For now, I support the approach @texadactyl is suggesting!

@lacker
It is common in projects to have several large files to process in succession. People are not always aware of how they are taxing data centre compute nodes. The purpose of giving an informative warning (not the current message!) to someone who is approaching the edge is so that they can consider pausing or replanning their batch runs. It also might be useful as information in consulting with @mattlebofsky on how best to reschedule or possibly switch to a different data centre computing node. Note that our data centre home directories and large data areas are available on multiple nodes.

How about just making the warning message something that will still be accurate when it's being called as a library. For example instead of "data not loaded, try another (t, v) pair" it could just say "warning: allocating {x} memory while loading file {fname}, only {y} memory still available."

@telegraphic Thanks for the review. Hopefully, hyperSETI will do a better job of caching (especially during plotting, somehow). Do I need to invest in some NVIDIA hardware at home? (-:

This proposal will not fix blimpy's inability to cache. It will simply (1) automate the calculation of the max_load value on the fly and (2) warn the caller or operator (as the case may be) that RAM is currently getting tight. Call it a band-aid if you like.

Regarding caching, application-level caching in blimpy would, IMO, cause a rewrite of the entire io subfolder and probably more elsewhere in blimpy. SETI BL budget questions would arise:

  • How long until blimpy is retired?
  • What projects would be impacted by a significant rewrite?
  • Are the requirements for a rewrite simply our collective view that it makes sense?
  • How much further investment in blimpy is justified?

Turbo_seti is already doing its own HDF5 read caching in the find_doppler/data_handler.py module (splitting by coarse channel). Unfortunately, find_event/plot_event_pipeline.py (On/Off or complex cadence file handling) suffers from matplotlib requiring the entirety of the data array be present in contiguous RAM. Pity that.

Regarding multiple polarisation support (absent in both blimpy and turbo_seti), more SETI BL budget questions arise: cost versus benefit.

@siemion @stevecroft too. Need a software planning exercise? I would recommend it. Already under way? Pardon my lack of awareness.

@lacker That is precisely what I had in mind. The warning would actually be a report of the max_load requested PLUS the current values from psutil.virtual_memory() PLUS some analysis. You read my mind!

Mr. @lacker has taken the bate/bait.

Answering some of @texadactyl's questions:

  • How long until blimpy is retired? No plans to retire it. But I think this year we should be shifting from development to maintenance and improving documentation. Plus, demand-driven development, e.g. if there is a new downstream project that motivates a change upstream in blimpy.
  • What projects would be impacted by a significant rewrite? Main downstream project is turboseti, but blimpy is used for a variety of things. (I'm of the same opinion that turboseti shouldn't be retired, but we should be shifting to maintenance.)
  • Are the requirements for a rewrite simply our collective view that it makes sense? Yes and no -- I guess this should be driven by science requirements.
  • How much further investment in blimpy is justified? It will remain a useful tool, so I think we should continue to support it over the course of Breakthrough Listen (next 5 years).
  • Need a software planning exercise? I would recommend it. Haha we certainly do! I wonder whether there will be a chance for an in-person meeting this year...