Normalize enconding of source files

Question

Normalize enconding of source files

josesimoes opened this issue 8 years ago · comments

It seems that the files in the repo don't have a consistent encoding.
This causes problems, for example, when submitting PRs because GitHub finds differences where there are no real differences rather chars that are different (or don't have correspondence) from one code page to another.
Suggest this gets normalized and a note about it be added to the code standards.

Steve Maillet · Answer 1 · Wed Aug 10 2016 00:15:31 GMT+0800 (China Standard Time)

This will definitely be a part of the next release which is taking a "clean slate" approach (see the newly created orphaned branch. I'm a bit uneasy with the idea of changing encodings etc.. in the current branches though. The last time we tried something like that for normalizing line endings in the Llilum project all hell broke loose.

José Simões · Answer 2 · Wed Aug 10 2016 00:26:49 GMT+0800 (China Standard Time)

@smaillet-ms fine with me. Just saying.
You were the one point out this situation yesterday.
I was the one to blame because my editor, for some reason, choose to save the file on UTF-8 encoding.

When going through the code everything looks OK until you submit the PR and see the diff in GitHub. Then one needs to go and change the encoding of that file. It's not efficient and can be rather tedious...

Steve Maillet · Answer 3 · Wed Aug 10 2016 00:37:36 GMT+0800 (China Standard Time)

Yes, not really blaming you. You got victimized here. Given that I want to treat the next version with a clean slate that we bring code over piece by piece and make it conform to newer designs and style guides - the encoding can be standardized at that point fairly readily. Manually editing them all just for this seems a bit tedious and not very helpful in the end.

As to what encoding we should use it's going to depend on the file type as not all language tools understand the various encoding types (i.e. I can't remember if C++ officially acknowledges the encoding of source files or if it leaves that as an implementation detail where some compilers might not recognize a UTF BOM and would fail to compile code...)

Jan Kučera · Answer 4 · Wed Aug 10 2016 12:11:09 GMT+0800 (China Standard Time)

Well, at the end the standard does not really matter, does it? (in the sense that if any of the tools does not follow it, having it mentioned in the standard does not help fixing the problem)

I don't know either, quick search brings e.g. this opinion that it's an implementation detail, although most "modern" C++ compilers can do it. I am more worried about the embedded 3rd party tools.

Also it should be probably mentioned that supporting UTF and recognizing BOM are two different things. UTF encoding without BOM still constitutes valid data in all 8-bit code pages, but not vice versa.

I think the problem gets only worse as people have different default code pages. It is impossible to detect which one was intended and I can't imagine people would keep saving (and opening) files in the selected encoding, especially in Visual Studio.

So I would think the order of preference should be:

UTF-8 with BOM (probably quite radical); if not possible
UTF-8 without BOM (as good as any codepage); if not possible
avoiding non-ASCII content

The current state - having © in the files relies on implementation details of the tools anyway.

Jan Kučera · Answer 5 · Wed Aug 10 2016 12:37:59 GMT+0800 (China Standard Time)

Btw. quick check of cpp, c, h and asm files in the repo:

Files with BOM: 0
A9 ©: 6

Application\MicroBooter\MicroBooter.cpp
DeviceCode\pal\OpenSSL\OpenSSL_1_0_0\crypto\rand\rand_win.cpp
ProjectTemplates\MicroBooter\MicroBooterExt.cpp
DeviceCode\include\MicroBooter_decl.h
DeviceCode\include\TinyCLR_Endian.h
ProjectTemplates\MicroBooter\MicroBooter.h

B5 µ: 4

DeviceCode\Drivers\BlockStorage\SD\SD_BL_driver.cpp
DeviceCode\Drivers\BlockStorage\SD\SD_BL_driver.cpp
DeviceCode\Drivers\BlockStorage\SD\SD_BL_driver.cpp
DeviceCode\pal\OpenSSL\OpenSSL_1_0_0\crypto\bn\asm\x86_64-gcc.c

F6 ö: 2

DeviceCode\pal\OpenSSL\OpenSSL_1_0_0\crypto\x509v3\v3_pci.cpp
DeviceCode\pal\OpenSSL\OpenSSL_1_0_0\crypto\x509v3\v3_pcia.cpp

92 ʼ: 1

CLR\Tools\Include\CorError.h

EC μ: 1 (not 1252)

DeviceCode\Drivers\Ethernet\enc28j60\enc28j60.cpp

In addition to that,

Solutions\MCBSTM32F400\MicroBooter\MicroBooter.h and DeviceCode\pal\OpenSSL\OpenSSL_1_0_0\crypto\rand\rand_win.cpp are already UTF-8 without BOM, doesn't seem any tool was bothered.
DeviceCode\Drivers\Display\TextFonts\Font8x15\font8x15.cpp contains like every byte from 0x80 to 0xDE