Silently fails when providing incorrect schema for WEIGHT column

Question

Silently fails when providing incorrect schema for WEIGHT column

landmanbester opened this issue 2 years ago · comments

dask-ms version: 0.2.14
Python version: 3.8
Operating System: Ubuntu20.04

Description

I just noticed that xds_from_ms silently fails to add the WEIGHT column to the dataset if the provided schema does not contain a tuple for 'dims'. Strangely, it also overwrites the dimension names of known columns like FLAG and DATA, regardless of whether 'dims' is a tuple or not. Seems to work as expected if not providing a schema.

What I Did

Running

from daskms import xds_from_ms
schema = {}
schema['WEIGHT'] = {'dims': ('corr')}  # note the mistake here, 'dims' should be a tuple
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)

will produce

In [38]: xds
Out[38]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, FLAG-1: 32, FLAG-2: 4, DATA-1: 32, DATA-2: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, FLAG-1, FLAG-2, DATA-1, DATA-2
 Data variables:
     FLAG     (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
     DATA     (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

Note the dimension names of DATA and FLAG. When giving schema a tuple for 'dims' i.e.

schema['WEIGHT'] = {'dims': ('corr',)}
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)

we get

In [41]: xds
Out[41]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, FLAG-1: 32, FLAG-2: 4, corr: 4, DATA-1: 32, DATA-2: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, FLAG-1, FLAG-2, corr, DATA-1, DATA-2
 Data variables:
     FLAG     (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
     WEIGHT   (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
     DATA     (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

Now WEIGHT is there but DATA and FLAG still have the wrong dimension names. If no schema is given, we get

In [43]: xds
Out[43]:
[<xarray.Dataset>
 Dimensions:  (row: 758160, chan: 32, corr: 4)
 Coordinates:
     ROWID    (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
 Dimensions without coordinates: row, chan, corr
 Data variables:
     FLAG     (row, chan, corr) bool dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
     WEIGHT   (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
     DATA     (row, chan, corr) complex64 dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
 Attributes:
     __daskms_partition_schema__:  (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
     FIELD_ID:                     0
     DATA_DESC_ID:                 0]

which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.

Simon Perkins · Answer 1 · Mon Oct 31 2022 17:42:30 GMT+0800 (China Standard Time)

which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.

Thanks for the very thorough reproducer @landmanbester.

Without digging into the code in too much detail, I'd speculate that ("corr") translates to "corr" which is then treated as an Iterable so the the dims end up being evaluated as ("c", "o", "r", "r"). I'd need to dig more to understand why FLAG and DATA aren't getting assigned the default dimension names in this case.

Probably related:

#241