[Bug]: Strings containing unescaped quotes followed by commas are incorrectly truncated
bwest2397 opened this issue · comments
Version of the library
0.19.2
Describe the bug
Within a string with an unescaped quote followed at a later point by a comma, the string gets truncated after the second "
character in the unescaped quote within the string. If this string is at the end of the JSON object and the string is not immediately followed by }
(i.e. is followed by whitespace or e.g. a comma), then the final word in the string is parsed as a key with an empty (string) value.
This seems to relate to #44, but it seems the attempted fix for that bug report didn't fully resolve this.
How to reproduce
(Note, I've formatted the recovered/output JSON just to make it more readable)
For
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum"}')
the recovered JSON is:
{
"lorem": "Lorem \"ipsum"
}
For any of the following examples
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum" }')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum"\n}')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint, suntid est laborum",}')
the recovered JSON is:
{
"lorem": "Lorem \"ipsum",
"laborum": ""
}
Removing the comma, the output matches what we'd expect:
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum"}')
>>> repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum" }')
yields
{
"lorem": "Lorem \"ipsum\" excepteur sint suntid est laborum"
}
Expected behavior
>>> print(repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum"}'))
{"lorem": "Lorem \"ipsum\" excepteur sint, suntid est laborum"}
>>> print(repair_json('{"lorem": "Lorem "ipsum" excepteur sint suntid est laborum" }'))
{"lorem": "Lorem \"ipsum\" excepteur sint, suntid est laborum"}
This was tough because the library is actually acting as expected, I found a workaround that I am releasing now but is an unstable equilibrium when it comes to wrong delimiters because there are a million corner cases that can go wrong. Nonetheless the solution I found seems to be working and passes all tests.
Awesome, thanks! I tested the new release with some samples I had and they seem to work 👍