"Self-closing" SGML Tags

Question

"Self-closing" SGML Tags

aclindsa opened this issue 3 years ago · comments

I am attempting to process some OFX from a financial institution which is choosing to insert "self-closing" tags into the OFX like the following (when there are no investment transactions to report for a given time period):

<INVTRANLIST/>

The header for this file looks like:

OFXHEADER:100^M
DATA:OFXSGML^M
VERSION:102^M
SECURITY:NONE^M
ENCODING:USASCII^M
CHARSET:1252^M
COMPRESSION:NONE^M
OLDFILEUID:NONE^M
NEWFILEUID:NONE^M

When I attempt to parse this with ofxtools using

parser = OFXTree()
parser.parse(filename)
return parser.convert()

I get an exception backtrace on the parser.convert() portion like:

  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/Parser.py", line 132, in convert
    instance = Aggregate.from_etree(self._root)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 200, in from_etree
    instance = SubClass._convert(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 287, in _convert
    args, kwargs = functools.reduce(update_args, elem, initial)[:2]
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 271, in update_args
    value = Aggregate.from_etree(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 200, in from_etree
    instance = SubClass._convert(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 287, in _convert
    args, kwargs = functools.reduce(update_args, elem, initial)[:2]
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 271, in update_args
    value = Aggregate.from_etree(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 200, in from_etree
    instance = SubClass._convert(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 287, in _convert
    args, kwargs = functools.reduce(update_args, elem, initial)[:2]
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 271, in update_args
    value = Aggregate.from_etree(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 200, in from_etree
    instance = SubClass._convert(elem)
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 287, in _convert
    args, kwargs = functools.reduce(update_args, elem, initial)[:2]
  File "/home/aclindsa/.local/lib/python3.9/site-packages/ofxtools/models/base.py", line 251, in update_args
    raise OFXSpecError(f"{clsnm}.spec = {spec}; doesn't contain {attrname}")
ofxtools.models.base.OFXSpecError: INVSTMTRS.spec = ['dtasof', 'curdef', 'invacctfrom', 'invtranlist', 'invposlist', 'invbal', 'invoolist', 'mktginfo', 'inv401k', 'inv401
kbal']; doesn't contain invtranlist/

I'm sure this is my financial institution's fault, but is there a way to work around it in ofxtools?

Chris Singley · Answer 1 · Thu Jul 01 2021 21:17:29 GMT+0800 (China Standard Time)

Try subclassing ofxtools.Parser.TreeBuilder and overriding its regex. Then you can pass an instance of your subclass as the optional parser arg when you call OFXTree.parse()... not my nomenclature, sorry; I was just following F. Lundh's ElementTree API from stdlib. Then your custom regex will be used instead of the default.

<(?P<tag>[A-Z0-9./_ ]+?)>

That's your problem right there. Why did I even allow slashes and spaces in the first place?? The only reason this hasn't blown up before is that XML-style "self-closing" tags aren't valid OFXv1, so we haven't encountered this before.

So yeah, that's probably a bug I should look into as well.

Aaron Lindsay · Answer 2 · Thu Jul 01 2021 21:34:40 GMT+0800 (China Standard Time)

Thanks for the pointers! I'll try that when I get a chance. I haven't looked at how the regex is being used so I apologize if this is way off base, but are you suggesting I modify it to be something like the following (such that the trailing '/' is allowed but not captured in the () group)?

<(?P<tag>[A-Z0-9._ ]+?)/?>

And I think from your answer that you are not interested in having an option in ofxtools to support this behavior. Is that correct?

Chris Singley · Answer 3 · Thu Jul 01 2021 21:55:17 GMT+0800 (China Standard Time)

<(?P<tag>[A-Z0-9._ ]+?)/?>

Yeah something along those lines should do the trick.

And I think from your answer that you are not interested in having an option in ofxtools to support this behavior. Is that correct?

Basically correct, pending my verification that this is indeed invalid (have to trawl through the DTD I think). Does Quicken actually parse this?

If Quicken will take it, and we can find more than one FI who is formatting thus, I suppose there's an argument for putting it into the library's default parser. I don't think it'll damage anything... after all, at first glance the existing regex looks wrong about this stuff and nobody's ever noticed... although it's been many years since I last touched any of this code.

Mark me down as "prejudiced, but willing to be talked down". I would be interested in a regex that works for you in any case.

Chris Singley · Answer 4 · Sat Jul 03 2021 01:29:00 GMT+0800 (China Standard Time)

Looking into this a bit deeper... it's frustrating trying to define a legal character set for OFX tags. Neither the human-readable spec nor the DTD consider the problem in these terms. Essentially the DTD enumerates specific entities as the only legal tags; anything else is illegal.

Then there's this from the OFX spec (section 2.3.1 on SGML compliance):

Open Financial Exchange is not completely SGML-compliant because the specification allows unrecognized tags to be present. Clients and servers must skip over the unrecognized tags. That is, if a client or server does not recognize , it must ignore the tag and its enclosed data.

This would explain why your FI's bogus markup isn't causing Quicken's OFX parser to blow up... they're only using it for empty aggregates, and skipping unrecognized tags like INVTRANLIST\ unintentionally leads to the correct behavior (i.e. ignoring what's intended as an empty list).

ofxtools currently does not comply with this part of the spec. Implementing it would require some changes to ofxtools.models.base.Aggregate.from_etree() and the methods called thereby... in particular the Aggegrate._convert().

This is a map/reduce type workflow, iterating over each child node and looking up its tag in the class definition in order to perform the type conversion, then adding it to the args used to instantiate the parent.

The fix may be as simple as changing the part of update_args() that's blowing up you... rather than having it throw an error, have it return the input accum unaltered. That might do the trick, remaining relatively efficient.

Another possibility would be to do it in two passes, first pass filtering for valid tags, second pass doing the reduce to *args and **kwargs just as it does currently.

Chris Singley · Answer 5 · Sat Jul 03 2021 20:53:55 GMT+0800 (China Standard Time)

The fix may be as simple as changing the part of update_args() that's blowing up you... rather than having it throw an error, have it return the input accum unaltered.

That does indeed seem to do the trick without harming anything else. Committed as 8922923.

Do me a favor and test this on your bad data; see if it works for you.

I think the next step is to relax the TreeBuilder.regex to accept absolutely anything (decodable by the declared CHARSET) as a tag name. I think that should bring us into conformity with the OFX spec in this particular.

Aaron Lindsay · Answer 6 · Sun Jul 04 2021 09:32:02 GMT+0800 (China Standard Time)

I got a few minutes to play around with your latest changes tonight, and I think we made progress, but here is what I see now:

/mnt/data/documents/beancount/external/ofxtools/ofxtools/models/base.py:273: UnknownTagWarning: While parsing INVSTMTRS, encountered unknown tag INVTRANLIST/; skipping.
  warnings.warn(msg, category=UnknownTagWarning)
/mnt/data/documents/beancount/external/ofxtools/ofxtools/models/base.py:273: UnknownTagWarning: While parsing INVSTMTMSGSRSV1, encountered unknown tag SECLISTMSGSRSV1; skipping.

And when running code like the following:

parser = OFXTree()
parser.parse(filename)
o = parser.convert()
for statement in o.statements:
    for pos in statement.invposlist:
        pass

I get:

    for pos in statement.invposlist:
TypeError: 'NoneType' object is not iterable

Because of the above message about 'SECLISTMSGSRSV1' being under 'INVSTMTMSGSRSV1' (it isn't) and the missing statement.invposlist (it's present in the OFX), my guess is that the parser is still somehow getting tripped up by this tag. My guess without doing any actual debugging is that it is treating the INVTRANLIST\ as purely an 'open' tag, which throws the nesting of everything else off, since the expected 'closing' tag never occurs.

Here's an anonymized version of the offending OFX file (sorry for the .txt on the end, github didn't let me upload it with an .ofx extension...):
bad.ofx.txt

Chris Singley · Answer 7 · Sun Jul 04 2021 20:13:04 GMT+0800 (China Standard Time)

Yeah I think you're right, playing with your structure here. The parser's trying to add SECLISTMSGSRSV1 as a child of INVSTMTMSGSRSV1 rather than a sibling under OFX. INVTRANLIST/ gets pushed to the tag stack and never popped.

Of course, this means it's going to require more invasive surgery to repair your FI's brain damage. It won't be just a simple matter of swapping out the TreeBuilder.regex; their lexing semantics are wrong. In fact you'll need to leave the regex alone so that you capture the terminal slash, then later test for it & branch off.

Your situtation is going to require modifications to TreeBuilder._feedmatch(), basically changing the branch structure from this:

        if tag.startswith("/"):
            if text:
                raise ParseError(f"Tail text '{text}' after <{tag}>")
            logger.debug(f"Popping tag '{tag[1:]}'")
            self.end(tag[1:])
        else:
            self._start(tag, text, closetag)

to something like this:

        if tag.startswith("/"):
            if text:
                raise ParseError(f"Tail text '{text}' after <{tag}>")
            logger.debug(f"Popping tag '{tag[1:]}'")
            self.end(tag[1:])
        elif tag.endswith("/"):
            pass
        else:
            self._start(tag, text, closetag)

So subclass ofxtools.Parser.TreeBuilder, override accordingly, and pass an instance of it into ofxtools.Parser.OFXTree.parse(). See if THAT solves your problem.

Aaron Lindsay · Answer 8 · Mon Jul 05 2021 10:38:09 GMT+0800 (China Standard Time)

The solution you described (subclassing TreeBuilder and overriding _feedmatch()) appears to do the trick for me. Thanks for basically doing my work for me on this one!

(I am only hesitating to close this issue because I don't know whether you are happy with the resolution of the parsing relative to the spec that I inadvertently led you to find. I am satisfied with the current resolution.)

Chris Singley · Answer 9 · Mon Jul 05 2021 18:13:42 GMT+0800 (China Standard Time)

No problem. Your pleasure is our business.

I am indeed fine with the changes I pushed to ofxtools.models.base.Aggregate; even if they didn't address your issue, they improve conformance with the spec and don't break anything... indeed, these changes uncovered a few weaknesses of the unit tests that had previously gone undetected.

I'm still curious whether Quicken parses your FI's data - but only idly. Maybe they use sp to parse the OFX body, and self-closing tags are valid SGML? My curiousity does not extend to paying for the ISO spec or the SGML Handbook.

Aaron Lindsay · Answer 10 · Mon Jul 05 2021 20:37:01 GMT+0800 (China Standard Time)

I'm still curious whether Quicken parses your FI's data - but only idly.

Sorry, I saw your question earlier and then entirely forgot to respond to it. But unfortunately (fortunately?) I don't use Quicken or even have access to it.

R. Dennis Steed · Answer 11 · Mon Nov 22 2021 02:30:38 GMT+0800 (China Standard Time)

Not to get too deep into the weeds, but this issue appears to be about a financial institution using the XML shorthand for an element that has no content. (eg <tag />).

The OFX specification specifically disallows elements that have no data. (section 1.38 in OFX specification 2.3)

1.3.8 Element
An OFX document contains one or more elements. An element is some data bounded by a leading start tag
and a trailing end tag. For example, an element named BAZ, containing data “bar,” looks like this:
<BAZ>bar</BAZ>
An OFX element must contain data (not just white space) and may not contain other elements. This is a
refinement to the XML definition of an element which is more generic. An XML element containing other
elements is defined in OFX as an aggregate. OFX specifically disallows empty elements and elements
with mixed content.

So, not raising an exception is best characterized as enhanced permissive handling of a deviation from the OFX specification, not "improved conformance".

I have no problem with this. I would go nuts if Chrome raised an exception every time I browsed a web page with nonconformant HTML.

However, as a user I think it would be good to add this to the "Deviations from the OFX specification" section of the documentation.

Aaron Lindsay · Answer 12 · Mon Nov 22 2021 03:15:17 GMT+0800 (China Standard Time)

@rdsteed: I believe @csingley's discussion of conformance related to a different aspect than the one you are discussing. His fix in 8922923 serves only to accept unrecognized tags. Notably, the 'conformance' fix he describes does not update ofxtools to accept data-less elements (like what you're describing), and is not sufficient to fix my problem described above.

Instead, I needed to use his suggestion from this comment in my local code to work around my FI's brand of non-conformance (data-less elements without closing tags).

Chris Singley · Answer 13 · Mon Nov 22 2021 05:53:03 GMT+0800 (China Standard Time)

@rdsteed The issue raised here has to do with the OFXv1 spec (i.e. VERSION:102^M); your quote from the OFXv2 isn't responsive to the circumstance.

An SGML tag <INVTRANLIST/> is correctly parsed (per the OFXv1 spec) as an INVTRANLIST/, which is an unkown aggregate, and so should be skipped silently; no error should be raised. The relevant reference from section 2.3.1 of the spec is quoted upthread. The commit I made in response to this issue fixes the OFXv1 parser behavior in this regard.

If you find ofxtools parser behavior deviating from the spec, by all means feel free to get down into the weeds - I'm always happy to see people cracking open the spec. Just, y'know, probably open up your issue about it where you're hands-on with the primary sources (usually OFX data in the wild).

@aclindsa - if you're not seething with resentment about my refusal to relax the parser's behavior here, and you've got working processes in place to handle your actual data, you mind closing this ticket?