Parsedown: get all image links

Question

Parsedown: get all image links

MarkMessa opened this issue 5 years ago · comments

Is it possible to get all image links parsed by Parsedown?
I'm considering something like:

$Parsedown = new Parsedown();
$file = file_get_contents('filename.txt');
echo $Parsedown->text($file);

# output
image1.png
image2.png

filename.txt

![][image1]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam porttitor nulla id luctus hendrerit.

![](image2.png)
Integer sed ultricies ante, sed mattis mauris. Donec et nisl sapien. 

[image1]: image1.png

Taufik Nurrohman · Answer 1 · Tue Sep 17 2019 14:16:57 GMT+0800 (China Standard Time)

Hook to the inlineImage method and capture all src value to a public property:

class ParsedownGetImageSrc extends Parsedown {
    public $imageSrcData = [];
    public function inlineImage($Excerpt) {
        if ($Inline = parent::inlineImage($Excerpt)) {
            if (isset($Inline['element']['attributes']['src'])) {
                $this->imageSrcData[] = $Inline['element']['attributes']['src'];
            }
        }
        return $Inline;
    }
}

$parser = new ParsedownGetImageSrc;
$text = $parser->text(' ... ');

# All image `src` data now stored in `imageSrcData`
echo json_encode($parser->imageSrcData);

Mark Messa · Answer 2 · Tue Sep 17 2019 15:57:53 GMT+0800 (China Standard Time)

Ok, seems to work fine. Thnks!

php > require 'Parsedown.php';
php > require 'ParsedownGetImageSrc.php';
php > $parser = new ParsedownGetImageSrc;
php > 
php > $parser->text('Lorem ![](filename1.ext) ipsum.
php ' Dolor ![][image] sit amet.
php ' [image]: filename2.ext');
php > 
php > echo json_encode($parser->imageSrcData);
["filename1.ext","filename2.ext"]

Mark Messa · Answer 3 · Tue Sep 17 2019 18:24:00 GMT+0800 (China Standard Time)

@tovic

Considering that your extension requires the overhead of executing the whole Parsedown, I was considering a lighter alternative such as regex:

\!\[.*\]\((\S+)\s*.*\) to match ![title](filename.ext 'alt')
\[.+\]\:\s(\S+)(?:\s".*")? to match [image1]: image1.png "some title"

Any comment?

Taufik Nurrohman · Answer 4 · Tue Sep 17 2019 23:48:09 GMT+0800 (China Standard Time)

You will fail on this case:

![a](b)

aaa ![a](b) bbbb

    ![a](b)

~~~
![a](b)
~~~

aaa `![a](b)` bbb

Mark Messa · Answer 5 · Wed Sep 18 2019 00:44:18 GMT+0800 (China Standard Time)

It also fail with escaped references (demo):

![a](b)

![c](d)

\![a](b)

~~~
![a](b)
~~~

`![a](b)`

Any idea how to fix that?

Taufik Nurrohman · Answer 6 · Wed Sep 18 2019 00:57:34 GMT+0800 (China Standard Time)

Not possible without parsing it. The other solution is to parse the Markdown syntax to HTML and search for <img> tag with DOMDocument and such. So you don’t need to extend the Parsedown class.

Mark Messa · Answer 7 · Wed Sep 18 2019 07:07:16 GMT+0800 (China Standard Time)

Not possible without parsing it.

Parsing the document against the full Parsedown syntax to get just the image links is somewhat inefficient.

The other solution is to parse the Markdown syntax to HTML and search for tag with DOMDocument and such.

Again, this seems inefficient. There is a lot of overhead to create a full HTML version and then searching for tags. It would be better to search for image links directly into the markdown syntax.

Taufik Nurrohman · Answer 8 · Wed Sep 18 2019 07:15:16 GMT+0800 (China Standard Time)

Then just match every image URL. You should be able to get it somewhere from the internet.

/^https?:\/\/\S+\.(?:gif|jpe?g|png|svg)$/

Mark Messa · Answer 9 · Wed Sep 18 2019 08:05:41 GMT+0800 (China Standard Time)

This way you will get the url from <img>, but also from <a> which is not the case.
Besides, it will fail in the following cases:

# local file path instead of url
![title](filename.ext)

# escaped reference
\![a](b)

# code block
~~~
![a](b)
~~~

# code span
`![a](b)`

Note: The current accepted answer is already fine to me. This issue of overhead is just a comment rather than a bottleneck.