`dask-ms convert` from MS to Parquet is very slow
JSKenyon opened this issue · comments
- dask-ms version: master
- Python version: 3.8.10
- Operating System: Ubuntu 20.04
Description
dask-ms convert -o test.parquet -f parquet test.ms
can be incredibly slow, even on small datasets. The reason for this is that some of the subtables may have thousands of rows and all subtables are grouped by __row__
. This can lead to thousands of datasets and very poor performance (hours+ for a few GB dataset).
The specific subtables which I have noticed cause a problem (but this isn't guaranteed to be a complete list) are HISTORY
and POINTING
. I do not believe that either of these need to be grouped by __row__
, as I believe they should not have variable shapes.
Verifying the slow behaviour is simple. I invoked simms
with the following command:
simms -dec 30d0m0s -ra 0h0m0s -T vla -dt 2 -st 2 -nc 64 -f0 1712MHz -df 208kHz -pl RR RL LR LL -n test.ms
and then ran the conversion command above.
I implemented a rudimentary patch to differentiate between uniform and non-uniform subtables (just using their names), and this seemed to fix the problem. I hoped to have a discussion on whether or not we can safely assume subtables like those mentioned above are exempt from ordering by __row__
.
I am going to posit a theory - only subtables which include a data description field in this memo need to be grouped by __row__
. This essentially includes all subtables which self-describe their shapes i.e. may have variable shapes per row. If I am correct, the "nonuniform" subtables would be:
SPECTRAL_WINDOW
POLARIZATION
FEED
SOURCE
These tables shouldn't have a huge number of rows i.e. grouping by __row__
shouldn't be a problem.