thephpleague / html-to-markdown

Convert HTML to Markdown with PHP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Escape characters incorrectly added in front of valid markdown bullets.

deetergp opened this issue · comments

Version(s) affected

5.1

Description

Given a string that contains a combination of HTML line breaks and markdown bullets, when the the HTML is converted, the bullets are escaped. For example:

String

"List of stuff:<br />- List item one<br />- List <a href="http://foo.com" target="_blank" rel="noreferrer noopener">item</a> two<br />* List item [three] with braces"

Expected Result

List of stuff:
- List item one
- List [item](http://foo.com) two
* List item [three] with braces

Actual Result

List of stuff:
\- List item one
\- List [item](http://foo.com) two
\* List item \[three\] with braces

How to reproduce

See description.

Hmm, so the change is happening here but I don't really understand why. Happy to propose a fix once I understand why that method exists at all.

I don't believe this is an issue. The purpose of this library is to function as a general purpose conversion from one specific data type (HTML) to markdown. In your example you have an input string that has a mixed set of HTML code as well as markdown code with the expectation that the converter will be aware of this and handle each data type accordingly. This is a misconception.

The purpose of the method you've identified is to avoid formatting problems when converting what it sees as basic paragraph text. The resulting strings may contain characters which can be erroneously parsed as markdown by an interpreter further up the stack. Since its job is to return basic paragraph text it will intentionally escape those characters,

Correct usage would have your list presented as well-formed HTML using <ul> and <li> tags for the converter to then turn into appropriate markdown.

You can perform your conversion in multiple passes to get around this problem. First run your code as-is to get the Actual Result then run that string through another method that locates the desired escaped characters and un-escapes them as needed until it returns your Expected Result

For example (this is unexecuted pseudo-code):

// run converter
$markdown = $converter->convert( $html );

// regex to replace the first escaped asterisk or hyphen character on each line.
// alter or expand as needed for other characters (numbered lists, etc.)
$markdown = preg_replace( '/^\\([-\*]\s)/m', '\1', $markdown ); 

Sorry for not seeing the original issue report! I agree with @pandymic, this is expected behavior.

Ideally, the resulting Markdown should return the same HTML when you run it through a Markdown parser. The escape characters are necessary to make this happen, and are valid Markdown, and thus this is correct. Plug your actual and expected results into https://spec.commonmark.org/dingus/ to see :)