mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs

Home Page:https://pypi.org/project/json-repair/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

repair_json incorrectly truncates JSON strings with escaped quotes and commas

mlxyz opened this issue · comments

Describe the bug
The valid JSON {"foo": "bar \"foo\", baz"} gets turned into the broken JSON {"foo": "bar \\"foo"} when using repair_json.
I think this is related to the escaped quotes and comma.

To Reproduce
Steps to reproduce the behavior:

  1. Call repair_json('{"foo": "bar \"foo\", baz"}')
  2. Check the output {"foo": "bar \\"foo"}

Expected behavior
Correct output {"foo": "bar \"foo\", baz"}

so this is a tough one because if you pass that string to python without using r"" the string passed is: {"foo": "bar "foo", baz"} and the result is correct because it's impossible to know in advance if that comma closes the key/value pair or is part of the string.

if you call repair_json(r'{"foo": "bar \"foo\", baz"}') the result is correctly {"foo": "bar \"foo\", baz"}

Ah, got it - thanks!

I thought my problem was related to escaped quotes but apparently it is not and now i can't seem to figure it out:
I would expect {"foo": "bar \"e\", .", "g": ["h:"]]} to get repaired to {"foo": "bar \"e\", .", "g": ["h:"]} but repair_json(r'{"foo": "bar \"e\", .", "g": ["h:"]]}') returns {"foo": "bar \\\"e\\", ",": ": [\"h:"}.

Any idea why?

I hate escaping problems, I will push a new version in a few minutes with what is hopefully a definitive fix