pyparsing / pyparsing

Python library for creating PEG parsers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QuotedString unquote_results doesn't understand escaped whitespace

jakeanq opened this issue · comments

It seems that when using a QuotedString with unquote_results=True (the default), it will incorrectly expand escaped whitespace characters.

For example:

import pyparsing as pp
print(pp.QuotedString(quoteChar='"', escChar='\\').parse_string(r'"\\n"'))

Actual:

['\\\n']

Expected:

['\\n']

It works fine if I pass unquote_results=False (with the obvious downside of not unquoting the results...):

print(pp.QuotedString(quoteChar='"', escChar='\\', unquote_results=False).parse_string(r'"\\n"')

gives

['"\\\\n"']

Ooops, forgot the version info...

I've replicated this in pyparsing 3.0.9 (Python 3.7/3.10), 3.1.0a1 (Python 3.10) and 2.4.7 (Python 3.7).

I'll look into this before the next release.

Just want to confirm that you are not getting tripped up over the representation of backslashes in the output - that output is a backslash followed by a newline:

>>> bslash = "\\"
>>> nl = "\n"
>>> print(repr(bslash + nl))
'\\\n'

Here is more detail on the string returned from parsing with QuotedString:

>>> import pyparsing as pp
>>> res = pp.QuotedString(quoteChar='"', escChar='\\').parse_string(r'"\\n"')
>>> res[0]
'\\\n'
>>> len(res[0])
2

You can also have more control over this by passing convert_whitespace_escapes=False to the QuotedString constructor.

I'm going to add this unit test to the testUnit.py:

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        test_string = r'"\\n"'
        for test_parameters in (
                (True, True, ['\\\n'], 2, '\\', '\n'),
                (True, False, ['\\n'], 2, '\\', 'n'),
                (False, False, ['"\\\\n"'], 5, '"', '\\'),
        ):
            unquote_results, convert_ws_escapes, expected_list, expected_len, exp0, exp1 = test_parameters
            with self.subTest(f"Testing with parameters {test_parameters}"):
                qs_expr = pp.QuotedString(
                        quoteChar='"',
                        escChar='\\',
                        unquote_results=unquote_results,
                        convert_whitespace_escapes=convert_ws_escapes
                    )
                self.assertParseAndCheckList(
                    qs_expr,
                    test_string,
                    expected_list
                )

                result = qs_expr.parse_string(test_string)
                # display individual characters
                print(list(result[0]))

                self.assertEqual(expected_len, len(result[0]))
                self.assertEqual(exp0, result[0][0])
                self.assertEqual(exp1, result[0][1])
                print()

which currently gives these results:

['\\\n']
['\\', '\n']

['\\n']
['\\', 'n']

['"\\\\n"']
['"', '\\', '\\', 'n', '"']

I'm pretty sure these are the desired results.

To confirm, I was expecting parsing of the string "\\n" to result in a single backslash followed by an n character with convert_ws_escapes=True and unquote_results=True, which isn't covered in that test case - this would correspond to a test parameters entry of

(True, True, [r'\\n'], 2, '\\', 'n')

EDIT: I messed up the backslashes the first time around...

I've redone the test to make the expected results for each case clearer, and added two other test strings. I've made the input strings as explicit as I could by using f-strings - you can check that they are equivalent to the r-strings in the respective comments. There are no (False, True) cases because if we are not unquoting, then we don't try to convert the embedded whitespace.

There is no (True, True, test_string_0, [backslash, "n"]) case because that is not how unquoting with whitespace conversion works. To get the behavior you are looking for, you need to pass convert_whitespace_escapes=False, as demonstrated in the (True, False, test_string_0) case.

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        #fmt: off
        backslash = chr(92)  # a single backslash
        tab = "\t"
        newline = "\n"
        test_string_0 = f'"{backslash}{backslash}n"'              # r"\\n"
        test_string_1 = f'"{backslash}t{backslash}{backslash}n"'  # r"\t\\n"
        test_string_2 = f'"a{backslash}tb"'                       # r"a\tb"
        T, F = True, False  # these make the test cases format nicely
        for test_parameters in (
                # Parameters are the arguments to creating a QuotedString
                # and the expected parsed list of characters):
                # - unquote_results
                # - convert_whitespace_escapes
                # - test string
                # - expected parsed characters (broken out as separate
                #   list items (all those doubled backslashes make it
                #   difficult to interpret the output)
                (T, T, test_string_0, [backslash, newline]),
                (T, F, test_string_0, [backslash, "n"]),
                (F, F, test_string_0, ['"', backslash, backslash, "n", '"']),
                (T, T, test_string_1, [tab, backslash, newline]),
                (T, F, test_string_1, ["t", backslash, "n"]),
                (F, F, test_string_1, ['"', backslash, "t", backslash, backslash, "n", '"']),
                (T, T, test_string_2, ["a", tab, "b"]),
                (T, F, test_string_2, ["a", "t", "b"]),
                (F, F, test_string_2, ['"', "a", backslash, "t", "b", '"']),
        ):
            unquote_results, convert_ws_escapes, test_string, expected_list = test_parameters
            with self.subTest(msg=f"Testing with parameters {test_parameters}"):
                print(f"unquote_results: {unquote_results}"
                      f"\nconvert_whitespace_escapes: {convert_ws_escapes}")
                qs_expr = pp.QuotedString(
                        quoteChar='"',
                        escChar='\\',
                        unquote_results=unquote_results,
                        convert_whitespace_escapes=convert_ws_escapes
                    )
                result = qs_expr.parse_string(test_string)

                # do this instead of assertParserAndCheckList to explicitly
                # check and display the separate items in the list
                print("Results:")
                control_chars = {newline: "<NEWLINE>", backslash: "<BACKSLASH>", tab: "<TAB>"}
                print(f"[{', '.join(control_chars.get(c, repr(c)) for c in result[0])}]")
                self.assertEqual(expected_list, list(result[0]))

                print()
        #fmt: on

With these results:

unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, <NEWLINE>]

unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, 'n', '"']

unquote_results: True
convert_whitespace_escapes: True
Results:
[<TAB>, <BACKSLASH>, <NEWLINE>]

unquote_results: True
convert_whitespace_escapes: False
Results:
['t', <BACKSLASH>, 'n']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, 't', <BACKSLASH>, <BACKSLASH>, 'n', '"']

unquote_results: True
convert_whitespace_escapes: True
Results:
['a', <TAB>, 'b']

unquote_results: True
convert_whitespace_escapes: False
Results:
['a', 't', 'b']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', 'a', <BACKSLASH>, 't', 'b', '"']

That's a good idea making constants, very clear now!

For this case:

(T, T, test_string_0, [backslash, newline]),

I would still expect the correct results to be [backslash, "n"] (identical to the convert_whitespace_escapes=False case of test_string_0) as the backslash is escaped so there is no newline in the input.

Similarly for:

(T, T, test_string_1, [tab, backslash, newline]),

I would expect the output to be [tab, backslash, "n"] for the same reason.

Ok, I'm coming around to these changes. Here is the new set of tests:

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        #fmt: off
        backslash = chr(92)  # a single backslash
        tab = "\t"
        newline = "\n"
        test_string_0 = f'"{backslash}{backslash}n"'              # r"\\n"
        test_string_1 = f'"{backslash}t{backslash}{backslash}n"'  # r"\t\\n"
        test_string_2 = f'"a{backslash}tb"'                       # r"a\tb"
        test_string_3 = f'"{backslash}{backslash}{backslash}n"'   # r"\\\n"
        T, F = True, False  # these make the test cases format nicely
        for test_parameters in (
                # Parameters are the arguments to creating a QuotedString
                # and the expected parsed list of characters):
                # - unquote_results
                # - convert_whitespace_escapes
                # - test string
                # - expected parsed characters (broken out as separate
                #   list items (all those doubled backslashes make it
                #   difficult to interpret the output)
                (T, T, test_string_0, [backslash, "n"]),
                (T, F, test_string_0, [backslash, "n"]),
                (F, F, test_string_0, ['"', backslash, backslash, "n", '"']),
                (T, T, test_string_1, [tab, backslash, "n"]),
                (T, F, test_string_1, ["t", backslash, "n"]),
                (F, F, test_string_1, ['"', backslash, "t", backslash, backslash, "n", '"']),
                (T, T, test_string_2, ["a", tab, "b"]),
                (T, F, test_string_2, ["a", "t", "b"]),
                (F, F, test_string_2, ['"', "a", backslash, "t", "b", '"']),
                (T, T, test_string_3, [backslash, newline]),
                (T, F, test_string_3, [backslash, "n"]),
                (F, F, test_string_3, ['"', backslash, backslash, backslash, "n", '"']),
        ):

with these results

Testing with parameters (True, True, '"\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, 'n']

Testing with parameters (True, False, '"\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\\\n"', ['"', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, 'n', '"']

Testing with parameters (True, True, '"\\t\\\\n"', ['\t', '\\', 'n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<TAB>, <BACKSLASH>, 'n']

Testing with parameters (True, False, '"\\t\\\\n"', ['t', '\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
['t', <BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\t\\\\n"', ['"', '\\', 't', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, 't', <BACKSLASH>, <BACKSLASH>, 'n', '"']

Testing with parameters (True, True, '"a\\tb"', ['a', '\t', 'b'])
unquote_results: True
convert_whitespace_escapes: True
Results:
['a', <TAB>, 'b']

Testing with parameters (True, False, '"a\\tb"', ['a', 't', 'b'])
unquote_results: True
convert_whitespace_escapes: False
Results:
['a', 't', 'b']

Testing with parameters (False, False, '"a\\tb"', ['"', 'a', '\\', 't', 'b', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', 'a', <BACKSLASH>, 't', 'b', '"']

Testing with parameters (True, True, '"\\\\\\n"', ['\\', '\n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, <NEWLINE>]

Testing with parameters (True, False, '"\\\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\\\\\n"', ['"', '\\', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, <BACKSLASH>, 'n', '"']

This is a slightly breaking change, but I feel that this logic is more intuitive - instead of going through and converting whitespace markers first, and then going back and processing escapes, the code now just works left to right through the quoted string contents, using a little state machine to process backslashes and whatever following character there might be.

Nice! As far as I can see that all looks like what I'd expect :)