Silently fails when providing incorrect schema for WEIGHT column
landmanbester opened this issue · comments
- dask-ms version: 0.2.14
- Python version: 3.8
- Operating System: Ubuntu20.04
Description
I just noticed that xds_from_ms silently fails to add the WEIGHT column to the dataset if the provided schema does not contain a tuple for 'dims'. Strangely, it also overwrites the dimension names of known columns like FLAG and DATA, regardless of whether 'dims' is a tuple or not. Seems to work as expected if not providing a schema.
What I Did
Running
from daskms import xds_from_ms
schema = {}
schema['WEIGHT'] = {'dims': ('corr')} # note the mistake here, 'dims' should be a tuple
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)
will produce
In [38]: xds
Out[38]:
[<xarray.Dataset>
Dimensions: (row: 758160, FLAG-1: 32, FLAG-2: 4, DATA-1: 32, DATA-2: 4)
Coordinates:
ROWID (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
Dimensions without coordinates: row, FLAG-1, FLAG-2, DATA-1, DATA-2
Data variables:
FLAG (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
DATA (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
Attributes:
__daskms_partition_schema__: (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
FIELD_ID: 0
DATA_DESC_ID: 0]
Note the dimension names of DATA and FLAG. When giving schema a tuple for 'dims' i.e.
schema['WEIGHT'] = {'dims': ('corr',)}
xds = xds_from_ms('path/to/data.ms', columns=('DATA','WEIGHT','FLAG'), chunks={'row':-1, 'chan':8}, table_schema=schema)
we get
In [41]: xds
Out[41]:
[<xarray.Dataset>
Dimensions: (row: 758160, FLAG-1: 32, FLAG-2: 4, corr: 4, DATA-1: 32, DATA-2: 4)
Coordinates:
ROWID (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
Dimensions without coordinates: row, FLAG-1, FLAG-2, corr, DATA-1, DATA-2
Data variables:
FLAG (row, FLAG-1, FLAG-2) bool dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
WEIGHT (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
DATA (row, DATA-1, DATA-2) complex64 dask.array<chunksize=(758160, 32, 4), meta=np.ndarray>
Attributes:
__daskms_partition_schema__: (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
FIELD_ID: 0
DATA_DESC_ID: 0]
Now WEIGHT is there but DATA and FLAG still have the wrong dimension names. If no schema is given, we get
In [43]: xds
Out[43]:
[<xarray.Dataset>
Dimensions: (row: 758160, chan: 32, corr: 4)
Coordinates:
ROWID (row) int32 dask.array<chunksize=(758160,), meta=np.ndarray>
Dimensions without coordinates: row, chan, corr
Data variables:
FLAG (row, chan, corr) bool dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
WEIGHT (row, corr) float32 dask.array<chunksize=(758160, 4), meta=np.ndarray>
DATA (row, chan, corr) complex64 dask.array<chunksize=(758160, 8, 4), meta=np.ndarray>
Attributes:
__daskms_partition_schema__: (('FIELD_ID', 'int32'), ('DATA_DESC_ID', 'i...
FIELD_ID: 0
DATA_DESC_ID: 0]
which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.
which has everything as expected. This is fairly low priority but I thought I would report it anyway. dask-ms should either throw an error if 'dims' is not a tuple or just convert it to a tuple. Either way, the dimension names of known columns should not be altered.
Thanks for the very thorough reproducer @landmanbester.
Without digging into the code in too much detail, I'd speculate that ("corr")
translates to "corr"
which is then treated as an Iterable
so the the dims end up being evaluated as ("c", "o", "r", "r")
. I'd need to dig more to understand why FLAG
and DATA
aren't getting assigned the default dimension names in this case.
Probably related: