jgm / commonmark-hs

Pure Haskell commonmark parsing library, designed to be flexible and extensible

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Autolinks extension should ignore URIs inside link descriptions

kukimik opened this issue · comments

Calling:

commonmark-cli -x autolinks <<EOF
[https://www.website.com](https://www.website.com#something)

[nobody@example.com](mailto:nobody@example.com?subject=Some%20subject)

[A website similar to https://www.foo.com and https://www.bar.com](https://www.baz.com)
EOF

results in (note the nested <a> tags):

<p><a href="https://www.website.com#something"><a href="https://www.website.com">https://www.website.com</a></a></p>
<p><a href="mailto:nobody@example.com?subject=Some%20subject"><a href="mailto:nobody@example.com">nobody@example.com</a></a></p>
<p><a href="https://www.baz.com">A website similar to <a href="https://www.foo.com">https://www.foo.com</a> and <a href="https://www.bar.com">https://www.bar.com</a></a></p>

while I would expect

<p><a href="https://www.website.com#something">https://www.website.com</a></p>
<p><a href="mailto:nobody@example.com?subject=Some%20subject">nobody@example.com</a></p>
<p><a href="https://www.baz.com">A website similar to https://www.foo.com and https://www.bar.com</a></p>

One reason is that nested links are illegal in HTML5 and HTML4.

This bite me in srid/emanote#349.

Related issue about explicit autolinks: commonmark/commonmark-spec#719

Actually this may be a bit hard to achieve, given the architecture used in this library. If we were parsing to an AST, we could simply substitute any links in the link description for their associated link text. But this library allows you to parse directly to an output format, so this isn't possible in general. Moreover, we don't know whether a bit of text is part of a link description until AFTER we've parsed it as an autolink (since the matching of brackets takes place at a later stage).

If you parse to an AST (which is possible, just not required, with this library), then you can always walk the document after parsing and remove links inside links.