Non-standard MS columns have an auto-generated schema which is not chunked according to logic for standard dimensions
landmanbester opened this issue · comments
- dask-ms version: 0.2.11
- Python version: 3.8
- Operating System: Ubuntu 20.04
Description
I am trying to convert an MS to zarr chunked by row and channel and its falling over with ValueError: Codec does not support buffers of > 2147483647 bytes
despite the chunks only containing 25000 rows and 128 channels (around 25 MB by my count). Somewhat weirdly I think the error is mostly harmless because it does produce a dataset that I can subsequently read. I have not checked if all the subtables are what they should be though.
What I Did
Here is the full output from convert
$ dask-ms convert ms1_primary.ms -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -o ms1_primary.zarr --chunks="{row:25000,chan:128}" --format zarr --force
2022-08-04 11:05:59,954 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 0 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:00,178 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 98332 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:00,657 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 196664 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:01,007 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 294996 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:01,196 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 393328 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:01,395 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 491660 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:01,625 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 589992 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:01,838 - dask-ms - WARNING - Ignoring 'FLAG_CATEGORY': Unable to infer shape of column 'FLAG_CATEGORY' due to:
'Table DataManager error: Invalid operation: TSM: no array in row 688324 of column FLAG_CATEGORY in /home/bester/projects/ESO137/msdir/ms1_primary.ms/table.f18'
2022-08-04 11:06:02,008 - dask-ms - INFO - Input: 'measurementset' file:///home/bester/projects/ESO137/msdir/ms1_primary.ms
2022-08-04 11:06:02,008 - dask-ms - INFO - Output: 'zarr' file:///home/bester/projects/ESO137/msdir/ms1_primary.zarr
2022-08-04 11:06:09,797 - dask-ms - WARNING - Ignoring SOURCE
2022-08-04 11:06:09,802 - dask-ms - WARNING - Ignoring 'DIRECTION': Unable to infer shape of column 'DIRECTION' due to:
'TableProxy::getCell: no such row'
2022-08-04 11:06:09,803 - dask-ms - WARNING - Ignoring 'TARGET': Unable to infer shape of column 'TARGET' due to:
'TableProxy::getCell: no such row'
> /home/bester/software/dask-ms/daskms/apps/convert.py(354)execute()
-> dask.compute(writes)
(Pdb) c
Traceback (most recent call last):
File "/home/bester/.venv/dms/bin/dask-ms", line 33, in <module>
sys.exit(load_entry_point('dask-ms', 'console_scripts', 'dask-ms')())
File "/home/bester/software/dask-ms/daskms/apps/entrypoint.py", line 9, in main
return EntryPoint(sys.argv[1:]).execute()
File "/home/bester/software/dask-ms/daskms/apps/entrypoint.py", line 32, in execute
cmd.execute()
File "/home/bester/software/dask-ms/daskms/apps/convert.py", line 354, in execute
dask.compute(writes)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/base.py", line 598, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
results = get_async(
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/bester/software/dask-ms/daskms/experimental/zarr/__init__.py", line 187, in zarr_setter
zarray[selection] = data
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 1353, in __setitem__
self.set_basic_selection(pure_selection, value, fields=fields)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 1448, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 1748, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 1800, in _set_selection
self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 2062, in _chunk_setitem
self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 2073, in _chunk_setitem_nosync
self.chunk_store[ckey] = self._encode_chunk(cdata)
File "/home/bester/.venv/dms/lib/python3.8/site-packages/zarr/core.py", line 2194, in _encode_chunk
cdata = self._compressor.encode(chunk)
File "numcodecs/blosc.pyx", line 557, in numcodecs.blosc.Blosc.encode
File "/home/bester/.venv/dms/lib/python3.8/site-packages/numcodecs/compat.py", line 155, in ensure_contiguous_ndarray
ensure_contiguous_ndarray_like(
File "/home/bester/.venv/dms/lib/python3.8/site-packages/numcodecs/compat.py", line 121, in ensure_contiguous_ndarray_like
raise ValueError(msg)
ValueError: Codec does not support buffers of > 2147483647 bytes
But I can still read the main table
Python 3.8.13 (default, Apr 19 2022, 00:53:22)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from daskms import xds_from_storage_ms
In [2]: xds = xds_from_storage_ms('ms1_primary.zarr/')
In [3]: d = xds[0].DATA.values
In [4]: d.shape
Out[4]: (98332, 4096, 4)
so I am not sure what is happening here.
Thanks for reporting. Can you run the command again within pdb as follows:
$ python -m pdb $(which dask-ms) convert ms1_primary.ms -g "FIELD_ID,DATA_DESC_ID,SCAN_NUMBER" -o ms1_primary.zarr --chunks="{row:25000,chan:128}" --format zarr --force
and report on the dimensions of zarray
, selection
and data
in the following part of the stack trace?
File "/home/bester/software/dask-ms/daskms/experimental/zarr/__init__.py", line 187, in zarr_setter
zarray[selection] = data
But I can still read the main table
This probably because it get created upfront. I'll bet that its filled with zeros. What does d.chunks
report?
Is it possible that it is actually a subtable that is causing the problem? I don't recall how those are chunked (or, indeed, if they are left unchunked).
The chunks are as expected and DATA seems populated
In [1]: from daskms import xds_from_storage_ms
In [2]: xds = xds_from_storage_ms('ms1_primary.zarr/')
In [3]: xds[0].DATA.chunks
Out[3]:
((25000, 25000, 25000, 23332),
(128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
(4,))
In [4]: d = xds[0].DATA.values
In [5]: d[0]
Out[5]:
array([[ 1.4974735e+03+0.0000000e+00j, -1.8867255e+03+7.0398737e+02j,
-1.8867255e+03-7.0398737e+02j, 2.5700393e+03+0.0000000e+00j],
[ 4.8791138e+01+0.0000000e+00j, 1.0086360e-02+5.2753524e-03j,
1.0086360e-02-5.2753524e-03j, 4.0523239e+01+0.0000000e+00j],
[ 4.8889027e+01+0.0000000e+00j, 8.3513446e-02+5.6876391e-03j,
8.3513446e-02-5.6876391e-03j, 4.0596863e+01+0.0000000e+00j],
...,
[ 3.5337872e+01+0.0000000e+00j, 3.0996327e-04+5.0446814e-01j,
3.0996327e-04-5.0446814e-01j, 1.4307446e+01+0.0000000e+00j],
[ 3.5533756e+01+0.0000000e+00j, 2.2583008e-02+5.0612271e-01j,
2.2583008e-02-5.0612271e-01j, 1.4375396e+01+0.0000000e+00j],
[ 3.6027515e+01+0.0000000e+00j, -8.5114129e-03+5.1566869e-01j,
-8.5114129e-03-5.1566869e-01j, 1.4549672e+01+0.0000000e+00j]],
dtype=complex64)
I also suspect it may be one of the subtables because it happens right at the end of a run. I am still waiting for it to fall over again so I can report the information you asked for @sjperkins (unfortunately oates is acting up again and things are taking forever)
Is it possible that it is actually a subtable that is causing the problem? I don't recall how those are chunked (or, indeed, if they are left unchunked).
That'd be a really big subtable if a column has ~2GiB of data.
Also just spitballing some figures (complex64 == 8 bytes)
98332 x 4096 x 4 x 8 ~= 24GiB
25000 x 128 x 4 x 8 ~= 97MiB (which should be fine)
I also suspect it may be one of the subtables because it happens right at the end of a run.
Hmmmm that is interesting...
I don't recall how those are chunked (or, indeed, if they are left unchunked).
They are left unchunked. One way of finding out if there are large subtables would be to do something like a:
$ du -hs ms1_primary.ms/
It does not seem that way
$ du -h ms1_primary.ms/
32K ms1_primary.ms/SOURCE
32K ms1_primary.ms/ANTENNA
20K ms1_primary.ms/FLAG_CMD
20K ms1_primary.ms/PROCESSOR
44K ms1_primary.ms/FEED
160K ms1_primary.ms/SPECTRAL_WINDOW
20K ms1_primary.ms/DATA_DESCRIPTION
28K ms1_primary.ms/OBSERVATION
24K ms1_primary.ms/POLARIZATION
20K ms1_primary.ms/STATE
96K ms1_primary.ms/POINTING
28K ms1_primary.ms/FIELD
28K ms1_primary.ms/HISTORY
387G ms1_primary.ms/
They are left unchunked.
Actually this isn't quite true. A default chunking of 10,000 rows is applied.
It does not seem that way
OK must be one of the DATA columns then. If you have your initial attempt at writing the zarr dataset out still lying around, can you do a:
from pprint import pprint
datasets = xds_from_ms(...)
pprint(list(dict(ds.chunks) for ds in datasets))
Ah, I have some non-standard columns in there
In [1]: from daskms import xds_from_storage_ms
In [2]: xds = xds_from_storage_ms('ms1_primary.zarr/')
In [3]: from pprint import pprint
In [4]: pprint(list(dict(ds.chunks) for ds in xds))
[{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)},
{'RESIDUAL-1': (4096,),
'RESIDUAL-2': (4,),
'chan': (128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128,
128),
'corr': (4,),
'row': (25000, 25000, 25000, 23332),
'uvw': (3,)}]
In [5]: r = xds[0].RESIDUAL.values
In [6]: r[0]
Out[6]:
array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
Looks like the RESIDUAL column is written with frequency chunks of 4096. This works at the outset because it can be compressed?
Ah, could this be a schema thing? A column not in the default schema will not know about the chan
axis.
Edit: Posted before I saw the above. I think this is def the root cause.
Ah yes 25000 x 4096 x 4 x 8 = 3GiB
Ah, could this be a schema thing? A column not in the default schema will not know about the
chan
axis.
@JSKenyon and I just discussed this in a meets. The problem here is that there are non-standard columns in the MS.
As it stands xds_from_
takes a schema
argument allowing one to configure this properly. A couple of solutions might be possible here:
- A
--schema
argument fordask-ms convert
- Writing a
__daskms__attributes__
column keyword into the MS column containing the dimension schema.
My 2 cents:
-
requires the user to know which non-standard columns exist in the MS upfront. You could maybe bail out with a nice informative error message that tells the user which non-standard columns to specify a --schema for but that is a bit clunky.
-
will only solve the problem if the column was actually written by dask-ms so you would still run into the issue. Although you could probably resort to 1) if non-standard columns are detected.
A possible alternative (albeit not a very clean one) would be to check if unknown dimensions match existing dimensions in the MS and then chunk them the same (eg. in the above case RESIDUAL-1 matches 'chan' along axis 1 and could be chunked the same). I suspect this will work 99% of the time. Maybe print a warning if this is the case, throw an error if any non-standard columns don't match any existing dimensions and resort to 1).
1. requires the user to know which non-standard columns exist in the MS upfront. You could maybe bail out with a nice informative error message that tells the user which non-standard columns to specify a --schema for but that is a bit clunky.
This may be the best option as it can be detected quickly. That said, this could get very unwieldy if an MS has lots of non-standard columns.
2. will only solve the problem if the column was actually written by dask-ms so you would still run into the issue. Although you could probably resort to 1) if non-standard columns are detected.
It is true that this wouldn't fix the problem if the column was written by software not using dask-ms. However, it might by a decent 90% solution with the remaining 10% solved by option 1.
A possible alternative (albeit not a very clean one) would be to check if unknown dimensions match existing dimensions in the MS and then chunk them the same (eg. in the above case RESIDUAL-1 matches 'chan' along axis 1 and could be chunked the same). I suspect this will work 99% of the time. Maybe print a warning if this is the case, throw an error if any non-standard columns don't match any existing dimensions and resort to 1).
This is possible but it can be slightly brittle. For the MAIN table, this is plausible as we could do as you suggest with a dim priority in the event that there are dims of the same size. My suggested priority would be [chan, corr, uvw]
for the MAIN table i.e. a (nrow, 3)
column will prioritise (row, chan)
over (row, uvw)
if the known channel dimension is 3. Option 1 would then only be needed in cases where the user is doing the unusual i.e. adding a new column with uvw
as a dim.
Unfortunately, none of the above succeed in completely hiding this from the user, although option 2 will come close for our software e.g. QuartiCal and pfb-clean. I think that adding non-standard columns is relatively unusual in the legacy stack (outside of CubiCal etc).
Finally, all of the above is only true for the main table. Subtables are probably even tougher to deal with as they each have different dims. On top of that we need to remember that xds_to_table
can technically write arbitrary new tables, though this is less of a problem if option 2 is in place.
How about applying chunking heuristics to DATA-like columns only?
Can you think of any other column "schemas" that Quartical/pfb-clean/Cubical use?
How about applying chunking heuristics to DATA-like columns only?
That will suffice for my purposes.
Can you think of any other column "schemas" that Quartical/pfb-clean/Cubical use?
Only the WEIGHT column but I believe that will be deprecated eventually and I am strongly opposed to using it anyway
@Athanaseus just got hit by this again. Converting an MS with non-standard columns leaves the resulting dataset in a state that is hard to deal with since the will not have the expected chan and corr dimensions for non-standard columns
Can you think of any other column "schemas" that Quartical/pfb-clean/Cubical use?
This MS also has a BITFLAG column
I should block off some time to look at this tomorrow. One possible workaround is to use the --exclude
flag, if the column is unnecessary.
Thanks @sjperkins. This is the column he wants to image. Actually he used the --exclude
flag to drop the dozen or so other non-standard columns already (I guess an inevitable side effect of experimentation). I can wrangle the dataset into shape manually for now
Thanks @sjperkins. This is the column he wants to image. Actually he used the
--exclude
flag to drop the dozen or so other non-standard columns already (I guess an inevitable side effect of experimentation). I can wrangle the dataset into shape manually for now
Are these columns shaped like DATA/FLAG?