what's the proper way to read MeerKAT visibility out?
astroJyWang opened this issue · comments
@ludwigschwardt
if I write
vis = data.vis
it's quick
but if I write
vis = data.vis[:,:,:128]
it is a long long time...
The reason why it is slow seems to be caused by the limited network speed (10MBps for me) or inefficient readout method (e.g., maybe continuously reading can be much more fast?).
So what is the proper way to read out a segment of the visibility data (for me I'm interested in auto-corr data).
There is a big difference between data.vis
and data.vis[:]
. The first one is a lazy representation of your data, but not the data itself. The second one fetches the actual data. That's why the latter is slow :-)
The main issue with downloads is the chunking scheme. The data is split into smallish chunks (about 10 MB is a typical size). It is first split in time (i.e. every dump has a different set of chunks) and then in frequency (into contiguous parts of the band). The baselines are not split.
This means that it is very inefficient to get the data for a single antenna (or just the autos, for example). You always get all the baselines with every chunk. This is optimised for imaging but maybe not for what you are doing.
One tip is not to loop over the antennas when downloading data. That could speed things up by a factor of 64. In your case it seems you are already doing this, based on the [..., :128]
part. Also, do as much selection up front (especially in time) to speed things up. Throw out slews and parts of the band you don't need.
For high-speed processing you really need to be running your reductions in the CHPC though.
so, if we move the native data to IDIA and do analysis there, the chuck problem can be avoided?
and, why MeerKAT choose rdb instead of h5 now? thanks.
Possibly, if you have fast access to the distributed filesystem where the data will be stored.
We had to move away from a single file because it would be too large for the datasets we expect, hence the chunks. We then chose RDB for metadata because it is closer to our online representation (a Redis database).
Slightly tangentially: how might one write out a .ms file** from a katdal.VisibilityDataV4 object that has undergone a select operation on it? i.e.:
d=katdal.open('1599998888_sdp_l0.full.rdb')
d.select(timerange=('2019-03-05 22:00:00', '2019-03-05 23:00:00'), freqrange=(1400e6, 1420e6), scans='track')
d.toms('msname.ms') !??
** why would I want to do this? ILUFU pipelines currently need .ms file inputs. I imagine other things might too.
*** I know that mvftoms.py exists, and can use it. However, this route is not ideal since it does not have the range of select operations available to it, and, I imagine, is also similarly unoptimised.
That is a long-standing dream of mine :-) The short-term solution is to hack select
into your own copy of mvftoms.py
.
There is some thought required on how to handle calibration and averaging (pre or post selection), but I think this is a worthy pursuit. Maybe make a ticket :-)