jgm / commonmark-hs

Pure Haskell commonmark parsing library, designed to be flexible and extensible

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[fuzz result] inline processing instructions can't parse more than once in a block?

notriddle opened this issue · comments

There's something wrong with the way this sample is parsed.

this works wrong <?xml?> <?xml?>

this works fine <xml> <xml>

pandoc try

The second paragraph has two tags, but the first, for some reason, does not:

<p>this works wrong <?xml?> &lt;?xml?&gt;</p>

commonmark.js does what I'd expect, turning both processing instructions into inline HTML tags. The spec also doesn't seem to say anything that would make processing instructions different than tags.

Events from pulldown-cmark:

"} <??><??>\n" -> [
  Start(Paragraph)
    Text(Borrowed("} "))
    InlineHtml(Borrowed("<??>"))
    InlineHtml(Borrowed("<??>"))
  End(Paragraph)
]

Events from pandoc:

"} <??><??>\n" -> [
  Start(Paragraph)
    Text(Boxed("} "))
    InlineHtml(Boxed("<??>"))
    Text(Boxed("<??>"))
  End(Paragraph)
]

Events from commonmark.js:

"} <??><??>\n" -> [
  Start(Paragraph)
    Text(Boxed("} "))
    InlineHtml(Boxed("<??>"))
    InlineHtml(Boxed("<??>"))
  End(Paragraph)
]

This is undoubtedly because of the scannedForProcessingInstruction stuff in https://github.com/jgm/commonmark-hs/blob/master/commonmark/src/Commonmark/Tag.hs#L160-L177
I think that was added to avoid some pathological parsing cases, but I can't see how the code was ever supposed to work! Similar code for declarations. Probably both should be removed, but I need to remember why they were added.

Oh, I see -- this is nonbacktracking state. Well, then it should be easily fixed.