noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty list cannot be closed after a newline

AriX opened this issue · comments

Thanks for your awesome work on this project!

I'm seeing an issue with JSON like the following:

{
  "num" : 1,
  "list_of_strings" : [
  ]
}

In particular, if a newline is generated after the array's opening [, lm-format-enforcer will not allow the list to be closed with a ]. It appears that:

  • When the [ character is parsed, a UnionParser is added to the stack with a StringParsingState and a ForceStopParser
  • When the newline character is parsed, the UnionParser decides that only StringParsingState can accept newlines, and therefore dissolves itself, returning only the StringParsingState onto the stack and removing the ForceStopParser
  • Without ForceStopParser on the stack, JsonSchemaParser's allowedCharacters implementation does not evaluate any parsers on the stack above StringParsingState, because StringParsingState returns False for canEnd()
  • Therefore, allowedCharacters does not include ] and the list cannot be closed

This can be verified by adding this test to test_jsonschemaparser.py:

def test_empty_list_with_newline():
    class EmptyListOKModel(BaseModel):
        num: int
        list_of_strings: Optional[List[str]] = Field(None, min_length=0, max_length=1)
    
    no_strings = '{"num":1,"list_of_strings":[\n]}'
    _test_json_schema_parsing_with_string(no_strings, EmptyListOKModel.model_json_schema(), True)

I'm not sure what the best solution here is, but some ideas I have are:

  • Have ForceStopParser allow newlines/whitespace
  • Prevent UnionParser from dissolving itself if one of its parsers is a ForceStopParser

Any input on what solution would be most idiomatic would be greatly appreciated.

Thanks for the report! I hope to fix this in the near future.

Solved in v0.8.3