ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS

Home Page:https://dask-ms.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

By default xds_to_table will create columns with standard (instead of tiled) storage managers

landmanbester opened this issue · comments

  • dask-ms version: master branch
  • Python version: 3.9
  • Operating System: ubunu 2004

Description

When writing a column that does not yet exist to a measurement set using xds_to_table it ends up having a standard storage manager associated with it. This can result in suboptimal performance when writing columns with channel and/or correlation axes.
dask-ms should detect if this is the case and use a tiled storage manager instead.

Closed via #196. Will leave it in master for you to test out @landmanbester. Reopen if you have trouble.

I just tried to test this locally with the following code snippet

import numpy as np
import dask
import dask.array as da
from daskms import xds_from_ms, xds_to_table

ms_name = "/home/landman/testing/pfb/MS/point_gauss_nb.MS_p0/"
xds = xds_from_ms(ms_name,
                  chunks={'row':10000})

test_data = da.zeros(xds[0].DATA.shape,
                     chunks=xds[0].DATA.chunks,
                     dtype=np.complex64)

xdso = []
for ds in xds:
    dso = ds.assign(**{'TESTCOL':(('row', 'chan', 'corr'), test_data)})
    xdso.append(dso)

writes = xds_to_table(xdso, ms_name, columns='TESTCOL')

dask.compute(writes)

and it fals over with

Traceback (most recent call last):
  File "/home/landman/software/scratch/test_xds_to_table.py", line 19, in <module>
    writes = xds_to_table(xdso, ms_name, columns='TESTCOL')
  File "/home/landman/software/dask-ms/daskms/dask_ms.py", line 88, in xds_to_table
    out_ds = write_datasets(table_name, xds, columns,
  File "/home/landman/software/dask-ms/daskms/writes.py", line 718, in write_datasets
    tp = _updated_table(table, datasets, columns, descriptor)
  File "/home/landman/software/dask-ms/daskms/writes.py", line 324, in _updated_table
    _dminfo = {} if _dminfo['*1']['NAME'] in odminfo else _dminfo
KeyError: '*1'

Not sure if I am doing something wrong?

Ah ok - I know what I did wrong. Reopening for now.

Weirdly, I cannot reproduce @landmanbester. If you change TESTCOL to DIFFTESTCOL, do you get the same error?

Same error. As discussed I also tried on an actual (as opposed to simulated) MeerKAT MS and I didn't get the same issue. The MS I was using originally was produced by makems so maybe something is going wrong there

The issue is the suffix on your MS. dask-ms where a column lives when building descriptors. Using name.MS_p0 confuses the descriptor builder. If you are happy with the explanation (and the fix i.e. just don't change the suffix), go ahead and close again.

Ah ok, thanks @JSKenyon. The suffix inherited from makems never made sense to me, not sure if it has some kind of significance but I'm happy to just change it