akosbalasko / yarle

Yarle - The ultimate converter of Evernote notes to Markdown

Home Page:https://github.com/akosbalasko/yarle

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Webclips extraction confused by tabs

Tokolino opened this issue · comments

I have a webclipped note where there are tabs at the beginning of the line - which makes the line interpreted as code block, and linked images are not recognised as such.

The attached note shows this behavior (in the lower part).
Debug.zip

Well, I'm afraid by default tabs are code blocks in Obsidian, and images within a codeblock is not supported by Obsidian at all. So, I think it is a feature request, but what would be the desired behavior, how should it look like?

Screenshot 2024-01-04 at 10 34 19

I think the problem is that Yarle writes tabs at the beginning of the line at all. It may be that these are in the HTML source code, but then they have a different (wrong) meaning for Obsidian. So the correct behavior would be to remove or escape the tabs (if possible).
It is not only a problem with the image extraction: As it is regarded as codeblock any markdown formatting is not rendered but shown as it is.
The only reasonable solution is: Tabs should be only at the beginning of a line when the text is intended to be a code block.

Aham. And what about a general "skip tabs recognition as codeblocks (by removing them)" toggle in the configuration panel?
But in relation of your last sentence, the question would be how Yarle should recognize that "the text is intended to be a code block". Any ideas are welcome.

The only thing that I know of (and I am not an HTML expert) are the tags <pre> and <code>. Concerning other notes other than webclips:
As leading tabs in notes are not an indication of code blocks, it is maybe an idea to replace them with a certain number of spaces when they are at the beginning of a line. Or add an additional space at the beginning of the line to prevent interpretation as code block.

Yes, that's true, but unfortunately evernote's Enex content is not a clear standard html. Now I tested it in the latest version, and the codeblock is stored as follows:

<div style="--en-codeblock:true;  ...

But as far as I remember these kind of new formats were introduced in v10, it was stored differently in v7. As I don't have that version, could you please create a simple codeblock-note in v7, and send me its enex exported in v7?

Thanks a lot!

No problem.
Debug.zip

thanks you! okay, the old one has almost the same div + style but a different attribute: -en-codeblock:true.
So the solution would be to convert codeblocks only if these settings are found in the note, and trim tabs from the beginning of the lines to prevent to be recognized them as codeblocks in Markdown.
Sense good?

What happens when you webclip this html page?

<html>
  <body>
    This is not a code block.
	<pre>
	  This is a code block.
	</pre>
    This is not a codeblock
  </body>
 </html>

I made it here: https://amethyst-juliana-94.tiiny.site/
Then I exported it in all of the meaningful variations: Fullpage, Article, SimplifiedArticle, Selection, here are the results:
In case of Fullopage and Article, it keeps the pre tags but converts a bit on the whole page like this:

<div style="min-height: 93px; font-size: 16px; display: block; min-width: 100%; position: relative;"> <div><div><span>
    This is not a code block.
	</span><pre>	  This is a code block.
	</pre>
    This is not a codeblock
  
 </div></div></div>

In SimplifiedArticle it replaces pre by div and puts the attribute <div style="--en-codeblock:true; :

<div style="--en-codeblock:true; --en-lineWrapping:false;box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902); background-position: initial initial; background-repeat: initial initial;"><div>      This is a code block.</div><div>    </div></div><div>     This is not a codeblock     </div>

and Multiple Selection e it replaces pre by div and puts the attribute <div style="--en-codeblock:true; and some others by span:

<div style="--en-codeblock:true; --en-lineWrapping:false;box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902); background-position: initial initial; background-repeat: initial initial;"><div>      This is a code block.</div><div>    </div></div><div><span style="font-size: 16px;">     This is not a codeblock     </span></div>

The variations differ from the layout because of the inline styles added to the divs, see screenshots:
Screenshot 2024-01-04 at 15 34 34
Screenshot 2024-01-04 at 15 34 38
Screenshot 2024-01-04 at 15 34 42
Screenshot 2024-01-04 at 15 34 47

So, long story short: Yarle has to be prepared for all of the possibilities: handle the attribute, and the "pre" as well. + a toggle to trim the tabs from the beginning of the lines.

You should also check what happens if a plain text note just contains tabs for formatting reasons.

eh... okay I tested the codeblock stuff and everything looks fine (I added some extra tests).
Then I switched to the tab issue, but then I realized that in Obsidian the images works well, and shown, even if they are in an intended (tab at the beginning) line.
Then I checked your enex file, and I think the problem is not around the tab at all, but the fact that the images are gif-s, which are not shown in Obsidian at all. I'm not sure if it is a bug there or not, but I think it is not related to the conversion.

I think you are not quite correct. I checked again with the note I provided in the original post. This is how a part of it looks in reading mode:
image
This is how it looks in edit mode:
image
And this is how it looks in edit mode with source view:
image
The image which is linked in this section has no back references:
image
(the image seems to be broken, but this is not the issue here)

Now I manually removed the leading tabs in this section, and this changed the situation. How it looks in edit mode with source view:
image
In edit mode:
image
And in reading mode:
image
And the image has its backlink:
image

All this change was only due to the removal of the leading tabs.

But I can reproduce the behaviour that the link is displayed (and linked) in a simple test note which has leading tabs. But, as shown above, in the other note the tabs clearly have an effect on the linking. But currently I have no idea what the reason is...

What I also notice: When I use tabs in simple text notes, then the text is not displayed in monospace, meaning it is not considered as code block. And of course, if a section is not considered as code, then images are linked and pictures are displayed.

Maybe once the line that contains a link is being edited may trigger a reload of the references, mentions, backlinks etc.

Nope. When I just change text in this line then nothing changes. When I remove the tab, then the backlink appears. When I enter the tab again, then the backlink is gone again. You can try yourself with the note I provided. My case is around line 209.

Well, yeah, this exactly shows that Obsidian interprets tabs, e.g. indentation "with more precedence", differently than normal text, and after hitting one or more tabs it looses the backlinks.
So I still think that the problem is around Obsidian, not around Yarle. And I hardly see anything to fix it on Yarle's side.
I don't want to implement an option any more that removes the starting indentations, because it can easily lead to data loss (I mean loss of indentation where it has real meaning).
I don't want to implement any workarounds neither, because of the fact that Obsidian interprets the produced markdown differently.
The only way what I can see to serve a controlled solution is to have a config option that removes the indentation IF ** the line contains a link** AND the note is a webclip

Currently I seem to have the impression:
If there is a non-indented line of text directly above the indented line, then the indentation does not create a code block. If there is no such a line, then it creates a code block.

I think the solution with the addional webclip option is best...

It's also mentioned in the help vault (source code view):
image

From my experience: If the line above the line with the tabs is also a code block, then the line is formatted as code block. So if I remove the blank lines between the two blocks then it is still seen as a code block, because the block above is a code block, too. If I just write normal text above (without indentation) then it is just formatted as indented text.

Thank you! But I think it is not working in all cases. It does in a lot of cases, but not all of them... In the attached enex-file check for the image before "Je m’appelle Alex" in the note "Voyage en Guyane _ Mes Conseils et Mon Itinéraire Idéal de 15 jours"

The attachment should be here:

No problem. Debug.zip