Dataset class should support encoding parameter to override global attribute

Question

Dataset class should support encoding parameter to override global attribute

thehesiod opened this issue 7 years ago · comments

In fact the whole idea of having a global encoding property is a bad idea because then you can't support multiple Datasets with different encodings. The real bug should be to deprecate netCDF4.encoding

as an example, the MADIS meso files are encoded in what appears to be cp1252.

Alexander Mohr · Answer 1 · Sun Apr 30 2017 13:57:55 GMT+0800 (China Standard Time)

if someone from the team can give feedback on how they'd like this achieved I can work on this.

Jeff Whitaker · Answer 2 · Sun Apr 30 2017 21:29:20 GMT+0800 (China Standard Time)

I suggest adding a set_encoding method to override the global value. There are already too many kwargs in Dataset.__init__. The new method should have a kwarg to optionally override the global value of unicode_error also.

Alexander Mohr · Answer 3 · Mon May 01 2017 00:34:04 GMT+0800 (China Standard Time)

not possible because we need encoding during init, for example when it calls _get_dims. Will looking into unicode_error

Alexander Mohr · Answer 4 · Mon May 01 2017 01:25:14 GMT+0800 (China Standard Time)

ok added support for encoding_errors looking into why unittests are failing

Stephan Hoyer · Answer 5 · Mon May 01 2017 12:26:58 GMT+0800 (China Standard Time)

Does netCDF really support arbitrary encodings for strings but not have any way of indicating them in the data model? That seems like a disaster waiting to happen...

Alexander Mohr · Answer 6 · Mon May 01 2017 12:35:06 GMT+0800 (China Standard Time)

from what I understand yes! I ran into this with MADIS mesonet dataset that had the "Annœullin" string in it (has character \x9c in it), From what I saw there was nothing specifying the encoding :( After some grepping around it seemed like this was most likely a CP1252 format from someone generating the files on a windows box

Some info: http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Strings-and-Variables-of-type-char

basically netCDF 3.x was not designed with a true "string" type, only added in netCDF 4.x. For being a "self-describing" format this was a huge oversight.

Jeff Whitaker · Answer 7 · Wed May 03 2017 01:17:10 GMT+0800 (China Standard Time)

Even with the netcdf-4 NC_STRING type there is no concept of an encoding in the data model. It's just stored as a string of bytes in the file, and the client has to know how to encode it into a string. I can't find anything in the CF metadata standard that relates to string encoding, so there's no standard way for the client to figure out how to encode it. The python interface always returns a string, and the encoding is currently defined by a global module variable. I gather the purpose of pull request #655 is to at least allow the user to change the encoding on a per-Dataset basis, so multiple Datasets can be accessed at once with different encodings specified.

When it comes to names of variables, dimensions, attributes, groups, and types netcdf-c always uses UTF-8 encoding.

Jeff Whitaker · Answer 8 · Wed May 03 2017 01:23:58 GMT+0800 (China Standard Time)

Just noticed this at http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations

and here http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html

The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.

which suggests that for NC_STRING variables (and attributes?) we should look for an attribute _Encoding.

Alexander Mohr · Answer 9 · Wed May 03 2017 01:36:56 GMT+0800 (China Standard Time)

hmm, still need to support files which don't specify the _Encoding. But sounds like enforcing _Encoding needs to be added to netcdf4-python? In my particular case there was no _Encoding attribute.

Jeff Whitaker · Answer 10 · Wed May 03 2017 02:12:19 GMT+0800 (China Standard Time)

I doubt that it is really used much. Perhaps we should check for it though. @WardF or @DennisHeimbigner - if you get a chance, could you read through this thread and comment?

Alexander Mohr · Answer 11 · Wed May 03 2017 02:36:40 GMT+0800 (China Standard Time)

btw another thing is my PR applies the encoding to all places the default_encoding was used before, based on what you found it sounds like some things like attribute names should always be UTF8 with some restrictions (which I'm guessing aren't checked for in the cython code).

Dennis Heimbigner · Answer 12 · Wed May 03 2017 03:14:43 GMT+0800 (China Standard Time)

You are correct: all netcdf names are assumed to be utf8, except that the character '/' is always
disallowed. When names occur inside, say, a cdl file, then certain other characters must be
back-slash escaped.
The default encoding for strings (as opposed to characters) is utf8, and ncdump, for example,
will assume that in the absence of any _Encoding attribute. The _Encoding attribute is currently ignored
in the netcdf-c code (reminder to self: add issue about at least recognizing it).
Character data is the real problem. Technically, is defaults also to utf8, which means the ascii
subset of utf8. In practice, characters can have any 8-bit bit pattern.

Stephan Hoyer · Answer 13 · Wed May 03 2017 03:23:54 GMT+0800 (China Standard Time)

For character data, ASCII with errors='surrogateescape' can be a good option for decoding into Python strings, when there is some possibility that somebody has stuffed arbitrary bytes in there. At least then, you can safely encode back into bytes and decode with the right encoding.

Dennis Heimbigner · Answer 14 · Wed May 03 2017 03:25:29 GMT+0800 (China Standard Time)

One more point. Character typed and String typed attributes must always be UTF8
because there is no way to specify _Encoding for an attribute.

Ethan Davis · Answer 15 · Wed May 03 2017 03:26:23 GMT+0800 (China Standard Time)

The _Encoding attribute was under discussion recently on the CF mailing list. @rsignell-usgs and I were just this morning discussing this topic and decided to create an issue on the Unidata/netcdf-c repo to update the NUG wording around the _Encoding attribute.

Alexander Mohr · Answer 16 · Wed May 03 2017 07:35:20 GMT+0800 (China Standard Time)

hah, as a side-note, I just found that some MADIS mesonet files are NOT in cp1252 as they fail to decode with that encoding, so it seems the files are a mixture of encodings without specifying what the encodings are :(

update: even worse, has garbage data as it doesn't seem to be in any reasonable encoding...another idea then is adding encoding validation when setting string data.

Jeff Whitaker · Answer 17 · Wed May 03 2017 21:51:03 GMT+0800 (China Standard Time)

So to summarize...

we should look for an _Encoding attribute for NC_STRING variable data, and use it for encoding, otherwise either use a default value or the value specified by a newset_encoding Dataset method.
for encoding character data into python strings we should use ASCII with errors=surrogateescape (i.e. in the chartostring utility function). In stringtochar we should decode using ascii.
For names of variables, dimensions, attributes and groups, always use UTF-8.
For attributes (either string or character) Dennis suggests we should always use UTF-8.

Does this sound reasonable?

For (1) do we really need a set_encoding method, or should we just rely on an _Encoding attribute and use UTF-8 if it's not there?

For (4), could we look for a Dataset or Variable _Encoding attributes and assume it applies to NC_STRING and NC_CHAR attributes?

Alexander Mohr · Answer 18 · Thu May 04 2017 03:11:08 GMT+0800 (China Standard Time)

Feedback from parts affecting me

prefer having an encoding __init__ parameter as the current code relies on it during initialization also it makes more sense, I don't think you want to be able to change encoding modes of the Dataset after it's been opened, I think logically it should be fixed once it's been opened.
not sure what this means, I know of CDF "classic" files that aren't utf-8 and doing this will make consumers of these files more difficult (will have to parse each string 2x)...instead of simply passing an encoding parameter, or perhaps could be named fallback_encoding (if not specified).
3 + 4. After doing this I think it will greatly simplify where encoding is used.

Jeff Whitaker · Answer 19 · Fri May 19 2017 10:37:53 GMT+0800 (China Standard Time)

this issue is address by pull request #665, which adds detection of the _Encoding attribute and the addition of the kwarg encoding to chartostring.

Jeff Whitaker · Answer 20 · Fri May 19 2017 23:52:12 GMT+0800 (China Standard Time)

pull request #665 merged, closing for now.