QuotedString unquote_results doesn't understand escaped whitespace

Question

QuotedString unquote_results doesn't understand escaped whitespace

jakeanq opened this issue a year ago · comments

It seems that when using a QuotedString with unquote_results=True (the default), it will incorrectly expand escaped whitespace characters.

For example:

import pyparsing as pp
print(pp.QuotedString(quoteChar='"', escChar='\\').parse_string(r'"\\n"'))

Actual:

['\\\n']

Expected:

['\\n']

It works fine if I pass unquote_results=False (with the obvious downside of not unquoting the results...):

print(pp.QuotedString(quoteChar='"', escChar='\\', unquote_results=False).parse_string(r'"\\n"')

gives

['"\\\\n"']

jakeanq · Answer 1 · Mon Mar 20 2023 11:57:23 GMT+0800 (China Standard Time)

Ooops, forgot the version info...

I've replicated this in pyparsing 3.0.9 (Python 3.7/3.10), 3.1.0a1 (Python 3.10) and 2.4.7 (Python 3.7).

Paul McGuire · Answer 2 · Wed Mar 22 2023 12:29:52 GMT+0800 (China Standard Time)

I'll look into this before the next release.

Paul McGuire · Answer 3 · Sat Mar 25 2023 23:28:14 GMT+0800 (China Standard Time)

Just want to confirm that you are not getting tripped up over the representation of backslashes in the output - that output is a backslash followed by a newline:

>>> bslash = "\\"
>>> nl = "\n"
>>> print(repr(bslash + nl))
'\\\n'

Here is more detail on the string returned from parsing with QuotedString:

>>> import pyparsing as pp
>>> res = pp.QuotedString(quoteChar='"', escChar='\\').parse_string(r'"\\n"')
>>> res[0]
'\\\n'
>>> len(res[0])
2

You can also have more control over this by passing convert_whitespace_escapes=False to the QuotedString constructor.

Paul McGuire · Answer 4 · Sun Mar 26 2023 06:07:53 GMT+0800 (China Standard Time)

I'm going to add this unit test to the testUnit.py:

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        test_string = r'"\\n"'
        for test_parameters in (
                (True, True, ['\\\n'], 2, '\\', '\n'),
                (True, False, ['\\n'], 2, '\\', 'n'),
                (False, False, ['"\\\\n"'], 5, '"', '\\'),
        ):
            unquote_results, convert_ws_escapes, expected_list, expected_len, exp0, exp1 = test_parameters
            with self.subTest(f"Testing with parameters {test_parameters}"):
                qs_expr = pp.QuotedString(
                        quoteChar='"',
                        escChar='\\',
                        unquote_results=unquote_results,
                        convert_whitespace_escapes=convert_ws_escapes
                    )
                self.assertParseAndCheckList(
                    qs_expr,
                    test_string,
                    expected_list
                )

                result = qs_expr.parse_string(test_string)
                # display individual characters
                print(list(result[0]))

                self.assertEqual(expected_len, len(result[0]))
                self.assertEqual(exp0, result[0][0])
                self.assertEqual(exp1, result[0][1])
                print()

which currently gives these results:

['\\\n']
['\\', '\n']

['\\n']
['\\', 'n']

['"\\\\n"']
['"', '\\', '\\', 'n', '"']

I'm pretty sure these are the desired results.

jakeanq · Answer 5 · Sun Mar 26 2023 10:21:25 GMT+0800 (China Standard Time)

To confirm, I was expecting parsing of the string "\\n" to result in a single backslash followed by an n character with convert_ws_escapes=True and unquote_results=True, which isn't covered in that test case - this would correspond to a test parameters entry of

(True, True, [r'\\n'], 2, '\\', 'n')

EDIT: I messed up the backslashes the first time around...

Paul McGuire · Answer 6 · Sun Mar 26 2023 22:37:45 GMT+0800 (China Standard Time)

I've redone the test to make the expected results for each case clearer, and added two other test strings. I've made the input strings as explicit as I could by using f-strings - you can check that they are equivalent to the r-strings in the respective comments. There are no (False, True) cases because if we are not unquoting, then we don't try to convert the embedded whitespace.

There is no (True, True, test_string_0, [backslash, "n"]) case because that is not how unquoting with whitespace conversion works. To get the behavior you are looking for, you need to pass convert_whitespace_escapes=False, as demonstrated in the (True, False, test_string_0) case.

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        #fmt: off
        backslash = chr(92)  # a single backslash
        tab = "\t"
        newline = "\n"
        test_string_0 = f'"{backslash}{backslash}n"'              # r"\\n"
        test_string_1 = f'"{backslash}t{backslash}{backslash}n"'  # r"\t\\n"
        test_string_2 = f'"a{backslash}tb"'                       # r"a\tb"
        T, F = True, False  # these make the test cases format nicely
        for test_parameters in (
                # Parameters are the arguments to creating a QuotedString
                # and the expected parsed list of characters):
                # - unquote_results
                # - convert_whitespace_escapes
                # - test string
                # - expected parsed characters (broken out as separate
                #   list items (all those doubled backslashes make it
                #   difficult to interpret the output)
                (T, T, test_string_0, [backslash, newline]),
                (T, F, test_string_0, [backslash, "n"]),
                (F, F, test_string_0, ['"', backslash, backslash, "n", '"']),
                (T, T, test_string_1, [tab, backslash, newline]),
                (T, F, test_string_1, ["t", backslash, "n"]),
                (F, F, test_string_1, ['"', backslash, "t", backslash, backslash, "n", '"']),
                (T, T, test_string_2, ["a", tab, "b"]),
                (T, F, test_string_2, ["a", "t", "b"]),
                (F, F, test_string_2, ['"', "a", backslash, "t", "b", '"']),
        ):
            unquote_results, convert_ws_escapes, test_string, expected_list = test_parameters
            with self.subTest(msg=f"Testing with parameters {test_parameters}"):
                print(f"unquote_results: {unquote_results}"
                      f"\nconvert_whitespace_escapes: {convert_ws_escapes}")
                qs_expr = pp.QuotedString(
                        quoteChar='"',
                        escChar='\\',
                        unquote_results=unquote_results,
                        convert_whitespace_escapes=convert_ws_escapes
                    )
                result = qs_expr.parse_string(test_string)

                # do this instead of assertParserAndCheckList to explicitly
                # check and display the separate items in the list
                print("Results:")
                control_chars = {newline: "<NEWLINE>", backslash: "<BACKSLASH>", tab: "<TAB>"}
                print(f"[{', '.join(control_chars.get(c, repr(c)) for c in result[0])}]")
                self.assertEqual(expected_list, list(result[0]))

                print()
        #fmt: on

With these results:

unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, <NEWLINE>]

unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, 'n', '"']

unquote_results: True
convert_whitespace_escapes: True
Results:
[<TAB>, <BACKSLASH>, <NEWLINE>]

unquote_results: True
convert_whitespace_escapes: False
Results:
['t', <BACKSLASH>, 'n']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, 't', <BACKSLASH>, <BACKSLASH>, 'n', '"']

unquote_results: True
convert_whitespace_escapes: True
Results:
['a', <TAB>, 'b']

unquote_results: True
convert_whitespace_escapes: False
Results:
['a', 't', 'b']

unquote_results: False
convert_whitespace_escapes: False
Results:
['"', 'a', <BACKSLASH>, 't', 'b', '"']

jakeanq · Answer 7 · Mon Mar 27 2023 11:35:05 GMT+0800 (China Standard Time)

That's a good idea making constants, very clear now!

For this case:

(T, T, test_string_0, [backslash, newline]),

I would still expect the correct results to be [backslash, "n"] (identical to the convert_whitespace_escapes=False case of test_string_0) as the backslash is escaped so there is no newline in the input.

Similarly for:

(T, T, test_string_1, [tab, backslash, newline]),

I would expect the output to be [tab, backslash, "n"] for the same reason.

Paul McGuire · Answer 8 · Tue Mar 28 2023 14:23:51 GMT+0800 (China Standard Time)

Ok, I'm coming around to these changes. Here is the new set of tests:

    def testQuotedStringUnquotesAndConvertWhitespaceEscapes(self):
        #fmt: off
        backslash = chr(92)  # a single backslash
        tab = "\t"
        newline = "\n"
        test_string_0 = f'"{backslash}{backslash}n"'              # r"\\n"
        test_string_1 = f'"{backslash}t{backslash}{backslash}n"'  # r"\t\\n"
        test_string_2 = f'"a{backslash}tb"'                       # r"a\tb"
        test_string_3 = f'"{backslash}{backslash}{backslash}n"'   # r"\\\n"
        T, F = True, False  # these make the test cases format nicely
        for test_parameters in (
                # Parameters are the arguments to creating a QuotedString
                # and the expected parsed list of characters):
                # - unquote_results
                # - convert_whitespace_escapes
                # - test string
                # - expected parsed characters (broken out as separate
                #   list items (all those doubled backslashes make it
                #   difficult to interpret the output)
                (T, T, test_string_0, [backslash, "n"]),
                (T, F, test_string_0, [backslash, "n"]),
                (F, F, test_string_0, ['"', backslash, backslash, "n", '"']),
                (T, T, test_string_1, [tab, backslash, "n"]),
                (T, F, test_string_1, ["t", backslash, "n"]),
                (F, F, test_string_1, ['"', backslash, "t", backslash, backslash, "n", '"']),
                (T, T, test_string_2, ["a", tab, "b"]),
                (T, F, test_string_2, ["a", "t", "b"]),
                (F, F, test_string_2, ['"', "a", backslash, "t", "b", '"']),
                (T, T, test_string_3, [backslash, newline]),
                (T, F, test_string_3, [backslash, "n"]),
                (F, F, test_string_3, ['"', backslash, backslash, backslash, "n", '"']),
        ):

with these results

Testing with parameters (True, True, '"\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, 'n']

Testing with parameters (True, False, '"\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\\\n"', ['"', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, 'n', '"']

Testing with parameters (True, True, '"\\t\\\\n"', ['\t', '\\', 'n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<TAB>, <BACKSLASH>, 'n']

Testing with parameters (True, False, '"\\t\\\\n"', ['t', '\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
['t', <BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\t\\\\n"', ['"', '\\', 't', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, 't', <BACKSLASH>, <BACKSLASH>, 'n', '"']

Testing with parameters (True, True, '"a\\tb"', ['a', '\t', 'b'])
unquote_results: True
convert_whitespace_escapes: True
Results:
['a', <TAB>, 'b']

Testing with parameters (True, False, '"a\\tb"', ['a', 't', 'b'])
unquote_results: True
convert_whitespace_escapes: False
Results:
['a', 't', 'b']

Testing with parameters (False, False, '"a\\tb"', ['"', 'a', '\\', 't', 'b', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', 'a', <BACKSLASH>, 't', 'b', '"']

Testing with parameters (True, True, '"\\\\\\n"', ['\\', '\n'])
unquote_results: True
convert_whitespace_escapes: True
Results:
[<BACKSLASH>, <NEWLINE>]

Testing with parameters (True, False, '"\\\\\\n"', ['\\', 'n'])
unquote_results: True
convert_whitespace_escapes: False
Results:
[<BACKSLASH>, 'n']

Testing with parameters (False, False, '"\\\\\\n"', ['"', '\\', '\\', '\\', 'n', '"'])
unquote_results: False
convert_whitespace_escapes: False
Results:
['"', <BACKSLASH>, <BACKSLASH>, <BACKSLASH>, 'n', '"']

This is a slightly breaking change, but I feel that this logic is more intuitive - instead of going through and converting whitespace markers first, and then going back and processing escapes, the code now just works left to right through the quoted string contents, using a little state machine to process backslashes and whatever following character there might be.

jakeanq · Answer 9 · Tue Mar 28 2023 16:20:08 GMT+0800 (China Standard Time)

Nice! As far as I can see that all looks like what I'd expect :)