erusev / parsedown

Better Markdown Parser in PHP

Home Page:https://parsedown.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsedown: get all image links

MarkMessa opened this issue · comments

Is it possible to get all image links parsed by Parsedown?
I'm considering something like:

$Parsedown = new Parsedown();
$file = file_get_contents('filename.txt');
echo $Parsedown->text($file);

# output
image1.png
image2.png

filename.txt

![][image1]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam porttitor nulla id luctus hendrerit.

![](image2.png)
Integer sed ultricies ante, sed mattis mauris. Donec et nisl sapien. 

[image1]: image1.png

Hook to the inlineImage method and capture all src value to a public property:

class ParsedownGetImageSrc extends Parsedown {
    public $imageSrcData = [];
    public function inlineImage($Excerpt) {
        if ($Inline = parent::inlineImage($Excerpt)) {
            if (isset($Inline['element']['attributes']['src'])) {
                $this->imageSrcData[] = $Inline['element']['attributes']['src'];
            }
        }
        return $Inline;
    }
}

$parser = new ParsedownGetImageSrc;
$text = $parser->text(' ... ');

# All image `src` data now stored in `imageSrcData`
echo json_encode($parser->imageSrcData);

Ok, seems to work fine. Thnks!

php > require 'Parsedown.php';
php > require 'ParsedownGetImageSrc.php';
php > $parser = new ParsedownGetImageSrc;
php > 
php > $parser->text('Lorem ![](filename1.ext) ipsum.
php ' Dolor ![][image] sit amet.
php ' [image]: filename2.ext');
php > 
php > echo json_encode($parser->imageSrcData);
["filename1.ext","filename2.ext"]

@tovic

Considering that your extension requires the overhead of executing the whole Parsedown, I was considering a lighter alternative such as regex:

  • \!\[.*\]\((\S+)\s*.*\) to match ![title](filename.ext 'alt')
  • \[.+\]\:\s(\S+)(?:\s".*")? to match [image1]: image1.png "some title"

Any comment?

You will fail on this case:

![a](b)

aaa ![a](b) bbbb

    ![a](b)

~~~
![a](b)
~~~

aaa `![a](b)` bbb

It also fail with escaped references (demo):

![a](b)

![c](d)

\![a](b)

~~~
![a](b)
~~~

`![a](b)`

Any idea how to fix that?

Not possible without parsing it. The other solution is to parse the Markdown syntax to HTML and search for <img> tag with DOMDocument and such. So you don’t need to extend the Parsedown class.

Not possible without parsing it.

Parsing the document against the full Parsedown syntax to get just the image links is somewhat inefficient.

 

The other solution is to parse the Markdown syntax to HTML and search for tag with DOMDocument and such.

Again, this seems inefficient. There is a lot of overhead to create a full HTML version and then searching for tags. It would be better to search for image links directly into the markdown syntax.

Then just match every image URL. You should be able to get it somewhere from the internet.

/^https?:\/\/\S+\.(?:gif|jpe?g|png|svg)$/

This way you will get the url from <img>, but also from <a> which is not the case.
Besides, it will fail in the following cases:

# local file path instead of url
![title](filename.ext)

# escaped reference
\![a](b)

# code block
~~~
![a](b)
~~~

# code span
`![a](b)`

Note: The current accepted answer is already fine to me. This issue of overhead is just a comment rather than a bottleneck.