martanne / vis

A vi-like editor based on Plan 9's structural regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Byte order marker (BOM) is displayed as empty cell

njhanley opened this issue · comments

The byte order marker (BOM) is the use of a zero width no-break space character (U+FEFF) at the start of a file to indicate the encoding byte order in UTF-16/32. While not useful in UTF-8, it is legal and occasionally used as a signature to indicate UTF-8 encoding.

Consider this file: bom.txt
When opened in vis, the BOM is visible as a blank cell when it should be invisible. Interestingly, ZWNBSP is correctly displayed (or rather not displayed) when part of the rest of the file.

https://unicode.org/faq/utf_bom.html#BOM

With reference to https://github.com/martanne/vis/wiki/FAQ#how-should-i-edit-files-in-legacy-encodings I would suggest WONTFIX here. vis (in comparison to vim) doesn’t go into business of dealing with encodings (and CRLF v LF), and it is just plain text editor. If anybody wants to get rid of BOM, there are ways how to do it. Also, if you are dealing with text files originating from that platform, you may well know that dos2unix removes BOM as well.

Yes, BOM in UTF-8 is an abomination of lesser platforms (so called “operating systems”), which punish everybody else for their unfortunate decision to use double-byte encoding for text, UTF-8 doesn’t need BOM, but whole that business should be kept outside of vis in my opinion.

The issue isn't that vis should interpret or remove BOMs; it's that a ZWNBSP at the start of a file (a BOM) is currently rendered differently from a ZWNBSP elsewhere in the file. See zwnbsp.txt. The ZWNBSP between 'H' and 'e' is correctly rendered as invisible.

Cannot reproduce here, with vis v0.8-git +curses +lua +tre +acl +selinux I get

screenshot-2023-05-06_22-05-1683406371

That was the point. If you open bom.txt vis consumes the cursor and the window renders incorrectly. In zwnbsp.txt the same bytes are present between h and e but vis correctly renders them as invisible and it doesn't effect the rest of the ui. You will have to use something like od to see the bytes eg: od -t x1 bom.txt

I have noticed this problem before but usually I just press x and delete the character if the file has it at the start because I really don't care about the file being compatible with where it came from.

The same behavior can be seen with other zero width characters such as zero-width space (ZWSP) and word joiner (WJ).

zwsp-start.txt vs zwsp-middle.txt
wj-start.txt vs wj-middle.txt

I still believe that the principle matters: all shenanigans with incorrectly encoded files (and yes a file with BOM is incorrectly encoded one) should stay outside of vis and by definition are NOT a vis problem.

I agree with the principle but I also don't like that the ui gets garbled by files like bom.txt. I suspect that its a one or two line fix to stop that from happening. If such a patch is presented I would see no issue with including it.

Sure, if it is so, then I guess, “SHOW ME THE PATCH!”. Also, what should happen with the content of the file? Should BOM should be just hidden but untouched in the file, or should it be really eliminated?

Leave it untouched like what happens when the bytes appear in the middle of the file.

I'll look into it later if I have time but I suspect what is happening is that vis is decrementing the index of where the next character is supposed to be drawn one cell too many when its the first character in the line. Then everything is off by one for rest of window.