JohannesKaufmann / html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🐛 Bug Can not handle img

youngjuning opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.

HTML Input

<figure><img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png"><figcaption></figcaption></figure>

Generated Markdown

<img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png">

Expected Markdown

nonting

I assume that you meant following html:

<figure>
    <img
        class="lazyload inited loaded"
        data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png"
        data-width="800"
        data-height="600"
        src=""     // empty?
        >
    <figcaption></figcaption>
</figure>

The "src" attribute is empty because the image is lazy-loaded.


I have thought about using “data-src” automatically when “src” is empty.

But there are three problems:

  1. "data-src" can filled with any data. It is not guaranteed to contain the url.
  2. The image url could also be somewhere else, like “data-lazy-url”.
  3. Some websites display a placeholder (for example “blank.gif”, the colours of the image, or a really low resolution of the image) in the “src”. The library can't really find out which image url is better unless it loads the images...

@youngjuning If you know the website and you know what the rules for lazy-loading are, I would recommend the following function:

// The hook-function is called before the rules are run. You can change the html that is passed to the "img" rule.
conv.Before(func(selec *goquery.Selection) {
	selec.Find("img").Each(func(i int, s *goquery.Selection) {
		_, ok := s.Attr("src")
		if ok {
			return
		}

		s.SetAttr("src", s.AttrOr("data-src", ""))
	})
})

But your solution works as well 👍

@JohannesKaufmann Thanks for your answer,you are so 👍