ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS

Home Page:https://dask-ms.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`dask-ms convert` from MS to Parquet is very slow

JSKenyon opened this issue · comments

  • dask-ms version: master
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04

Description

dask-ms convert -o test.parquet -f parquet test.ms can be incredibly slow, even on small datasets. The reason for this is that some of the subtables may have thousands of rows and all subtables are grouped by __row__. This can lead to thousands of datasets and very poor performance (hours+ for a few GB dataset).

The specific subtables which I have noticed cause a problem (but this isn't guaranteed to be a complete list) are HISTORY and POINTING. I do not believe that either of these need to be grouped by __row__, as I believe they should not have variable shapes.

Verifying the slow behaviour is simple. I invoked simms with the following command:

simms -dec 30d0m0s -ra 0h0m0s -T vla -dt 2 -st 2 -nc 64 -f0 1712MHz -df 208kHz -pl RR RL LR LL -n test.ms

and then ran the conversion command above.

I implemented a rudimentary patch to differentiate between uniform and non-uniform subtables (just using their names), and this seemed to fix the problem. I hoped to have a discussion on whether or not we can safely assume subtables like those mentioned above are exempt from ordering by __row__.

I am going to posit a theory - only subtables which include a data description field in this memo need to be grouped by __row__. This essentially includes all subtables which self-describe their shapes i.e. may have variable shapes per row. If I am correct, the "nonuniform" subtables would be:

  • SPECTRAL_WINDOW
  • POLARIZATION
  • FEED
  • SOURCE

These tables shouldn't have a huge number of rows i.e. grouping by __row__ shouldn't be a problem.

This should be fixed by #191. There is likely still room for improvement but at least parquet conversion happens in a reasonable amount of time now.