Continue to parse "mildly" invalid labels (e.g. when a UTF-8 character is found in meta-data)

Question

Continue to parse "mildly" invalid labels (e.g. when a UTF-8 character is found in meta-data)

msbentley opened this issue 3 years ago · comments

Is your feature request related to a problem? Please describe.
As discussed in MillionConcepts/pdr#12 there are some PDS3 products which are technically invalid, but for which a straightforward workaround exists. In this case, a UTF-8 character in a label causes PVL to reject it, but removing or otherwise flagging this character and continuing to read the file would be preferable (or else providing a strict mode which would fail on this, and a permissive mode which would continue).

Describe the solution you'd like
A way to configure whether pvl strictly fails to read a product due to minor standards infractions would be great. I note that astropy.io.votable has a pedantic keyword in its parsing routines which can be configured to strictly enforce the standard, or allow for some "fuziness". The same package also provides an exception class votable.exceptions.VOTableSpecWarning which warns when the spec is not met, but the data can still be read, which would be another option.

Additional context
See MillionConcepts/pdr#12 for further details and the link to the failing product.

Ross Beyer · Answer 1 · Wed Aug 25 2021 22:48:29 GMT+0800 (China Standard Time)

Yes, I would consider this as PVL-text that pvl should parse, but the question is how should this be treated?

Here's my thinking: by design PVL and its family of "dialects" accepts "most of the ISO 8859-1 'latin-1' character set with some exclusions." (except for ODL, which is ASCII-only, but nobody cares about ODL). Seems like it should be easy to relax that for the Omni grammar which is our already existent "fuzzy" omnivorous PVL-text parser, and just allow any character, which will solve the pvl.load() part of this, and allow the label to be parsed. However, the second part, writing this now-parsed text back out will be a problem. There is no "legal" way to write out that degree symbol in any "output" PVL dialect (because they don't allow it because of their character type limitations), and the user of pvl.dump() will get an Exception if they try it. They will get an exception, and then they are forced to go through their dict-like and decide what to do about the UTF-8 character(s) and use whatever methods they would like to remove or translate the UTF characters. Of course, users of pvl need not use pvl.dump() at all, and can do whatever they like with the dict-like returned from pvl.load() so maybe this wasn't even on your mind.

And now a digression on software philosophy, regarding a "strict" parsing concept: so the pre-1.0 pvl had a flag like this, but as I developed the 1.0 architecture, I felt that was the wrong paradigm. The 1.0 pvl architecture has what I generically call "dialects" of PVL. In the codebase, a given "dialect" is served by a set of grammar, parser, decoder, and encoder objects that work together. But just like LEGO bricks, you can put them together in different combinations. So in addition to the PVL, ODL, PDS3, and ISIS dialects, there is also an "Omni" dialect that is meant to be a catch-all. And the default "load" functionality uses that "Omni" dialect, so the intent of the code is already to allow "fuzziness" on loading PVL-text (the act of properly defining the "fuzz" on the "fuzziness" is ongoing with issues like this). If you wanted to be "strict" about a particular dialect, you can explicitly enable that during pvl.load() via the grammar, parser, and decoder arguments. And then on dump, the functionality is to write out the PDS3 dialect (but that, too can be changed). This is how the pvl_validate program diagnoses the PVL-text you give to it and can tell you what is wrong.

One could make the argument that we should still put in some sort of "easy" argument for strict parsing, rather than forcing a user to set a proper grammar, parser, and decoder, but there hasn't been a lot of call for it.

I can start noodling on making the Omni-parser not care about the character set.

Mark S. Bentley · Answer 2 · Thu Aug 26 2021 00:08:25 GMT+0800 (China Standard Time)

Great, sounds good @rbeyer - I have to say I hadn't considered the pvl writing option at all, no (I personally have little need for that). As I continue to plough through the PSA PDS3 holdings I'll raise tickets as I come across any other similar things!

Ross Beyer · Answer 3 · Thu Aug 26 2021 09:16:47 GMT+0800 (China Standard Time)

Ugh, I hate character encodings. They make me question the fundamental nature of reality: "what do all these bytes really mean, man?"

Anyway, turns out that the file did NOT have UTF-8 characters, it had latin-1 (ISO/IEC 8859-1) characters that were not even properly pulled out of the source file before they could even be parsed. I spent a lot of time discovering that, but then the fix was really trivial, so that's good. Turns out file -I is really useful.

Mark S. Bentley · Answer 4 · Thu Aug 26 2021 14:28:18 GMT+0800 (China Standard Time)

:-D Actually, I think I further confused matters in the original issue by giving one filename in the example, and an FTP link to a completely different file! One has 8859 and one has UTF-8 characters - I was looking at the latter, but gave the link to the former...

So probably you were looking at this one:

$ file -i BIO_20141125_DOY329_D001_V1.TAB
BIO_20141125_DOY329_D001_V1.TAB: text/plain; charset=iso-8859-1

great that this is solved! I was originally meaning this one:

ftp://psa.sciops.esa.int/pub/mirror/VENUS-EXPRESS/MAG/VEX-V-Y-MAG-3-EXT4-V1.0/DATA/ORB201411_D001/MAG_20141125_DOY329_D001_V1.TAB

$ file -i MAG_20141125_DOY329_D001_V1.TAB 
MAG_20141125_DOY329_D001_V1.TAB: text/plain; charset=utf-8

Sorry for the confusion, but perhaps we can kill two birds with one stone?

Ross Beyer · Answer 5 · Thu Aug 26 2021 21:57:21 GMT+0800 (China Standard Time)

That file is also encoded as latin-1 for me. I downloaded it via the PSA download service in a .zip file, and unzipped it: the file was charset=iso-8859-1. I wgetted it, and the file was charset=iso-8859-1. How are you getting a UTF-8 version?

Mark S. Bentley · Answer 6 · Thu Aug 26 2021 22:26:48 GMT+0800 (China Standard Time)

Yes, you're right! I must have messed with it somehow in my first attempts to get it to load; I downloaded again from the FTP and indeed get charset=iso-8859-1 - sorry for the noise!

Ross Beyer · Answer 7 · Thu Aug 26 2021 23:07:22 GMT+0800 (China Standard Time)

Not at all. The thing is that I think that pvl should be properly parsing UTF-8. So I'm actually interested in your UTF-8 version that is not parsing correctly, because it should, too.

Mark S. Bentley · Answer 8 · Thu Aug 26 2021 23:14:12 GMT+0800 (China Standard Time)

Here is the unintentionally edited file:
MAG_20141125_DOY329_D001_V1.zip

Ross Beyer · Answer 9 · Fri Aug 27 2021 01:20:03 GMT+0800 (China Standard Time)

First off, thank you. The PSA-provided latin-1 file was not parsing correctly, so that really was a bug.

And this business with UTF-8 is a slightly different wrinkle, but still interesting. So whatever process you performed to convert the latin-1-encoded file to UTF-8, converted the latin-1 degree symbol to the UTF-8 replacement character, and not to the UTF-8 degree symbol (ironically, if it had, the file would have parsed without any trouble), and you never would have surfaced the latin-1 bug.

This time, the whole block of text was "considered" but then there's this place in the parsing where the PVLgrammar object declares that any character encoding value (what you get when you run ord() on a single character) greater than 255 isn't "allowed" (per the PVL spec) so the parser ends the parsing of that hunk of text, and is left with something that starts with a quote mark, but doesn't end with one, and this is what errors later on when the decoder is trying to extract structure.

The simple fix here is to allow the OmniGrammar object (which descends from the PVLgrammar object) to not care what character value it gets. So it would happily consume every character, including the UTF-8 "replacement character". The problem here is then that these funky characters would be in the strings in the returned dict-like, but at least you'd get a dict-like returned, and it would be garbage-in-garbage-out, but that's really the user's fault. So my thinking is to go ahead and make that change. Let me know if you think we shouldn't make this change, and this condition should still properly error.

Mark S. Bentley · Answer 10 · Fri Aug 27 2021 01:41:55 GMT+0800 (China Standard Time)

I am happy if we don't fail, and then the user has to deal with any implications later - at least then things like pdr will still be able to pull out the data. I would also be happy for a more specific exception (here be dragons) to be thrown that tools could decide to act on or ignore.

Ross Beyer · Answer 11 · Fri Aug 27 2021 04:28:09 GMT+0800 (China Standard Time)

The problem is that you either parse it, or you throw an exception. Sure you could establish some mechanism that "ignores exceptions" but in truth this requires a lot of design thinking, because there are some exceptions that get thrown because the parsing cannot possibly continue (the PVL-text is fundamentally broken in some way), and then there are some that maybe you could stumble past and keep going.

It seems like the thing to do is parse characters when it can, and pass along the characters, and leave it up to the user to deal with the funky characters in the returned dict-like, if they care. Users can always chose the PDS3 parser if they really want funny characters to fail, but by default, we should probably take all comers.

Thanks for the dialog, Mark.

Michael Aye · Answer 12 · Fri Aug 27 2021 11:37:36 GMT+0800 (China Standard Time)

Would there be something wrong with parsing it and giving a warning?

Ross Beyer · Answer 13 · Fri Aug 27 2021 12:38:23 GMT+0800 (China Standard Time)

No, but what would the warning be about? The changed behavior is: user gives PVL-text to pvl, and pvl recognizes the structure as PVL structure, and puts that PVL-text into a dict-like and gives it back to the user. The fact that those characters may be unicode points above 255 is no less weird than a single-line comment that starts with an octothorpe (#), lines that end in a dash which are intended to mean line continuation, or keys with empty values (none of those things are allowed in PVL or PDS3-compliant PVL, but we parse all of them into a dict-like). I guess the library isn't particularly chatty about any of that, and we could rig up a bunch of logging calls.

If the user was concerned about emojis or tibetan characters in the text, they probably shouldn't be relying on the default, parse-everything-we-can nature of the pvl library to warn them about that. If they wanted to determine if the PVL-text with emojis in it was "valid" PVL or PDS3-compliant PVL, then we do have strict parsers that can tell them that or a handy command line utility.

Michael Aye · Answer 14 · Fri Aug 27 2021 13:14:44 GMT+0800 (China Standard Time)

I was just concerned that you approached it as a dichotomy (parse or throw), while I felt there's plenty of "in-betweens".

Ross Beyer · Answer 15 · Sun Aug 29 2021 12:04:49 GMT+0800 (China Standard Time)

I have discovered a wrinkle:

Prior to this PR, if pvl encountered a file that it could not wholly decode as UTF-8, the assumption was that this file was some attached label nonsense (ISIS cubes, IMG files, etc.), and so the bytes were extracted one at a time until a byte could not be decoded to a UTF-8 character, and then all of the decoded characters were sent to the PVL-text parsing algorithms.

This typically meant fast runs of pvl.load() even on large PDS .IMG files, because it would fail fast, and just harvest the PVL-text that could be parsed.

However, with the PR as written, as it is piecing through the bytes, if it can't decode the byte as UTF-8, it tries to do so as latin-1. However, what I didn't realize earlier this week is that since latin-1 can "decode" any 8-bit byte you give to it, this means that pvl no longer "fails fast", whole data files are converted into text (sometimes MB of data), and passed to the pvl parsers, which only properly deal with the first part, but you loose time ingesting a bunch of bytes that you don't need.

There are three solutions that I can see, other suggestions welcome:

Introduce a character encoding detection library like chardet, such that if a file fails wholesale UTF-decoding, you pass it to chardet which should be able to quickly detect whether the file is a latin-1 text-only label file, or a file which contains a bunch of "data" bytes (similar to what file -I snoops from your command line). The advantage here is that this should allow parsing of large IMG or ISIS files to be as fast as they are now, but also allow parsing of latin-1 text files, as needed. The downside is introduction of a required dependency. This isn't a deal-breaker, but always cause for reflection from me.
Parse-by-character or parse-by-chunk. Right now, the PVL-text is harvested from the source file, and then provided to the pvl parsing algorithms as a Python string. This makes the interface very, very clean and simple. I think the code could be altered to feed characters one at a time to the tokenizer, which could then pass tokens on to the parsers, such that the parsers would detect "end-of-PVL" and stop parsing when tokens coming out of the tokenizer broke the rules. The advantage here is that this should allow parsing of IMG or ISIS files to be as fast as they are now, it would allow parsing of latin-1 text, and wouldn't require any external dependencies. The downside is that I'm not sure how much work it would be to do this.
If it ain't UTF-8 (of which ASCII is a proper sub-set), we don't parse it. This means that if someone tried to parse a latin-1 file, pvl would just fail (and that seems bad since apparently the PSA is producing files encoded to latin-1). In Mark's case, if he had gotten that latin-1 file, and converted it to UTF-8 as he did, we would now happily parse that "replacement character" and as long as he didn't care that there as a strange question mark UTF-symbol there instead of a degree symbol. The advantages here are that this is almost trivial from a software perspective (I just roll back one of the two commits in the PR). The downside is that users would have to do a manual conversion step from latin-1 encoded text files to UTF-8 before providing them to pvl.load().

In any case the PSA should probably not be encoding text in latin-1. The PDS3 spec "requires" just ASCII characters, but clearly it is already too late if there's a degree symbol in an archived label.

M. St. Clair · Answer 16 · Mon Aug 30 2021 00:20:36 GMT+0800 (China Standard Time)

with the PR as written, as it is piecing through the bytes, if it can't decode the byte as UTF-8, it tries to do so as latin-1. However, what I didn't realize earlier this week is that since latin-1 can "decode" any 8-bit byte you give to it...

could it be worth checking to see whether the offending byte is in the 0020-007E; 00A0-00BE range? or do you think there will be enough failure cases for this check that it will be a waste of time? (I suppose there could be a binary array file in which all byte sequences were valid UTF-8, for that matter, although I think it would be a very contrived example.)

Ross Beyer · Answer 17 · Mon Aug 30 2021 00:53:27 GMT+0800 (China Standard Time)

@m-stclair, thanks, that's an interesting approach, but it cuts out too much, I would think. However, it knocks on the door of a solution that my son suggested that I like better than my above suggestions (which I'll get to below). If we have an example of someone encoding a latin-1 degree symbol (which is in the range you indicated), why might not we assume that they would encode a letter with diaeresis (a.k.a. the umlaut, outside the range you indicated) in encoding a data provider's name or institution? This isn't just about that range, but any arbitrary range that we could specify. The problem is that all latin-1-encoded characters, individually, could be entirely valid, and any single-byte sequence can be converted into a latin-1-decoded character (which is not true of UTF-8).

A fourth solution was suggested to me by my son:

Given that the labels we're interested in reading are mostly english language, what is the likelihood that there will be more than x characters in a row from the upper 128 characters of the latin-1 character set (which are mostly special characters like the degree and accented characters)? Probably pretty low. If we set a limit counter such that x is 3 or 5 or something, and then stop decoding bytes after we get that many characters of that kind in a row. The advantage here is that it still "fails fast" (takes a little longer because it will read a few more bytes, but not all that many more), doesn't require an external library, and would enable parsing of latin-1-encoded text files.

This works because the first 128 "character points" of ASCII, UTF, and latin-1 are "identical", such that any byte with a value less than or equal to 128 gets decoded to the same character in all three systems. So if we're decoding bytes one byte at a time, our first try is to UTF-decode that byte. If that works, then we're good. If it fails, that means our byte is greater than 128, and we can latin-1-decode it. This will always work, but is guaranteed to give a character from that upper set of accented characters and we can just keep a counter of how many of those there are in a row. If we get a run of more than x in a row, we're probably decoding bytes that are not meant to be characters, but are meant to be data, so we stop.

When I was describing this problem to my son, I described that these were typically attached labels which specified how long the label was and when the data started. He asked why I couldn't just use that specifier to decide when to stop decoding bytes? I decided if I can't trust someone to follow the spec to not encode in anything other than ASCII, I certainly can't trust them to be truthful about the length of their label. Fortunately, I think we can use the above approach to examine the data itself, and make good decisions about what to try and parse as PVL-text.

M. St. Clair · Answer 18 · Mon Aug 30 2021 01:04:57 GMT+0800 (China Standard Time)

I love this solution.

Ross Beyer · Answer 19 · Mon Aug 30 2021 09:00:37 GMT+0800 (China Standard Time)

Grrr, it was such a great solution on paper. However, there are files (like the one indicated in #95) where there is 26 MB of image data, but when you slice it into 8-bit bytes, every other byte can be decoded to a value less than 128, so gets decoded as UTF, and the counter of continuous upper latin-1 characters never gets above "1" and we end up reading the whole file.

I also experimented with chardet, even using its "incremental" detection facility, but for large IMG or ISIS files, it isn't getting enough "text" to get confident to make a "text encoding" decision until it reads the whole file, which doesn't save us any time.

I also experimented with both counting continuous "runs" of the upper latin-1 characters as the above suggestion, and also the total number of them in accumulated-to-date decoded characters, and played with a limit of the total fraction of upper latin-1, and while this works for some files, it does not "fail fast" in the example case above where half the bytes can be UTF bytes, and any decision you make about what that "fraction threshold" should be is easily violated.

This is surprisingly complicated.

It'll take me some time to investigate possibility number 2, and I already don't like it because it will mess with the architecture. And I'm leaning more and more towards possibility number 3: If it is a latin-1 encoded text file, pvl will not parse it, and it is up to the user to convert their latin-1-encoded file to UTF.

I suppose there's a fifth option where we could add an argument to pvl.load() that is encoding= such that a user could then explicitly select how pvl should decode the bytes in the file (which is probably why there are encoding= arguments to open() and pathlib.Path.read_text() because their developers didn't want to try and automatically guess either!). This way, if a user knows they have one of these goofy latin-1-encoded separate PVL files, they can specify that (without having to mess with conversion themselves, but they do have to know). I think it is unlikely that there would be latin-1 encoded text at the top of an attached label, since presumably that would have gone through more rigorous PDS/PSA validation (but what do I know?), and so we can go back to the existing fail-fast mechanism for IMG and ISIS files, by removing the attempt to try to decode those bytes as latin-1 characters. Would that be acceptable? Or is that insufficient?

M. St. Clair · Answer 20 · Mon Aug 30 2021 09:34:52 GMT+0800 (China Standard Time)

Wow, this is remarkably obnoxious, isn't it? I think it sounds like:

there is no correct solution to this problem
latin-1 is bad in labels and maybe just a bad character encoding
attempting to characterize the likely distribution of wrongly-encoded characters in PVL-texts is a dissertation idea for a CS PhD student whom you wish ill upon
attached labels are bad
possibility 2 is a development nightmare and will probably result in a worse pvl

I suppose I'm inclined to suggest possibility 3, perhaps along with adding a note about the possibility of a bad character encoding to the text of one or more exceptions? Possibility 5 seems like a minor headache for even less benefit -- it seems to me like most pvl endusers are likely to be able to convert their own texts if they know what a character encoding is, and if they don't, an encoding kwarg won't help them. @cmillion, do you have thoughts on this?

Ross Beyer · Answer 21 · Mon Aug 30 2021 10:09:06 GMT+0800 (China Standard Time)

Yes, maybe an encoding= kwarg is superfluous. If a user already knows they have PVL-text in a funny encoding, they can just decode themselves, and then pass the decoded Python string to pvl.loads() instead of us hanging an extra argument onto pvl.load().

Ross Beyer · Answer 22 · Tue Aug 31 2021 10:57:54 GMT+0800 (China Standard Time)

The PR has now been amended to not try and parse latin-1-encoded characters.

Adding an explicit warning about character encoding might be a red herring when the text harvesting stops because it has hit a data byte that cannot be decoded to UTF-8. In that case, we don't want an error, but we don't yet know if it is a bad character that is "inside" the PVL-text or if it is a bad character because it is data.

After the text is harvested and passed to the parsers, if there is a bad character "inside" the PVL-text this will likely generate a LexerError because the PVL-text just appears to suddenly stop which likely breaks a PVL element. LexerErrors show the text before the error, provide a line and column coordinate, as well as the total count of characters in the PVL-text where the error was encountered, so they are hopefully already sufficiently helpful.

So, to recap against the original impetus for this Issue: A UTF-encoded file will now always be successfully harvested and parsed (this was failing prior to the PR because the OmniParser was mistakenly obeying the strict PVL specification about allowed characters, mostly ASCII), so the file that Mark had on his hard drive that was failing will now work (so at least that's fixed).

Text files encoded as latin-1 (ISO 8859-1) will not be correctly parsed if they have non-ASCII characters (like a degree sign) in them, but if they are latin-1-encoded and only contain ASCII characters (like they should), then pvl happily processes them.

If you do have a latin-1-encoded file with characters that is causing pvl to error, you can convert it to a UTF file (degree signs and emoji included) and then feed that file to pvl.load(). Alternately in Python you can open the file via open() or pathlib.Path().read_text() with encoding="latin-1" and then pass that Python string to pvl.loads().

Chase Million · Answer 23 · Tue Aug 31 2021 22:05:48 GMT+0800 (China Standard Time)

I think there's a limit to how much poor data sanitation pvl should be expected to handle automagically, and that this is beyond it.