microsoft / language-server-protocol

Defines a common protocol for language servers.

Home Page:https://microsoft.github.io/language-server-protocol/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Change character units from UTF-16 code unit to Unicode codepoint

MaskRay opened this issue · comments

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
	range?: Range;
	rangeLength?: number;
	text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

jacobdufault/cquery#57

Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0

I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint.

@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use utf-8 here but as the header indicates this can be change to any encoding assuming that the encoding is supported in all libraries (for example node per default as only a limited set on encodings).

The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways.

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16.

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

Citation needed? ;)

Of the 7 downstream language completers we support in ycmd:

  • 1 uses byte offsets (libclang)
  • 6 use unicode code points (gocode, tern, tsserver*, jedi, racer, omnisharp*)
  • 0 use utf 16 code units

* full disclosure, I think these use code points, else we have a bug!

The last is a bit of a fib, because we're integrating Language Server API for java.

However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units.

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated.

Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p

Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints.

https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657

@vibhavp for Emacs lsp-mode internal representation

I am sorry in advance if I am telling something stupid right now. I have a question to you guys.

My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets.

That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes?

Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems.

@dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints.

commented

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares.

This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever.

Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open.

The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable.

Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion).

commented

I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.

I didn't mention technical reasons because these are mainly repetition of other people's opinion or well known. Anyway, I list my technical reasons here.

  • IMO, UTF-8 is present and future, and UTF-16 is legacy to avoid. The reason is here.
  • By requiring dependency to UTF-16, LSP effectively forces program implementation to involve the legacy.
  • Simplicity is better than extra complexity and dependency. One encoding for everywhere is better.
  • More complexity and dependency increases amount of work of implementation a lot.
  • AFAIK, Converting indices between different Unicode encodings are very expensive.
  • LSP is a new protocol. No reason to involve a bad legacy. The only benefit here is potential benefit to specific platforms with native UTF-16 native strings.
  • For now, the only reason to require UTF-16 is to give such benefit to specific implementations.
  • Other platforms wouldn't be very happy due to increased complexity and potential performance penalty in implementation.
  • Such unfair benefit is likely going to break community.

... or needs to be a capability the client announces

I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to utf-8, interpret the numbers as UTF-8 code points, and if it is utf-16, interpret them as UTF-16 code point. If the field is missing, fallback to UTF-16 for legacy compatibility.

This is causing us some implementation difficulty in clangd, which needs to interop with external indexes.
Using UTF-16 on network protocols is rare, so requiring indexes to provide UTF-16 column numbers is a major yak-shave and breaks abstractions.

Yup, same problem here working on reproto/reproto#34.

This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description.

As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:

  • client announces the encodings it supports for position encodings.
  • server picks an encoding to use.

If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients.

I would appreciate a PR for this that for example does the following:

commented

What about using byte indices directly? Using codepoints still requires to go through every single character.

@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice:

Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
In order to be backwards compatible the exchange has to be bi-directional. All servers have to support UTF-16 and fall back to this when the client indicates that this is their only capability. At least until a new major revision of the LSP has been widely adopted and the old one deprecated.

Using codepoints still requires to go through every single character.

This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches.

Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported.

commented

I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.

Anyway, LSP3 spec defines "character offset" as UTF-16 Code Unit, which means it implicitly defines term "character" as UTF-16 Code Unit as well. This is (1) non-sense as Unicode Code Unit is not intended to be a character and (2) inconsistent with other part where UTF-8 based.

In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs.


As far as I know, Unicode defines three considerable concepts of text assemblies.

  • Code Unit
  • Code Point
  • Grapheme Cluster

And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code.

As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion.

Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them.

There will be two distinctive position/offset counting mode (1) legacy-compatible and (2) grapheme-cluster-counting.

  • legacy-compatible mode is same with current. Defines "character" as UTF-16 Code Unit.
  • grapheme-cluster-counting mode defines "character" as "Grapheme Cluster" and uses count of Grapheme Cluster as position offset.

In LSP3, servers should support both of legacy-compatible(deprecated but default) and grapheme-cluster-counting modes.
In LSP4, grapheme-cluster-counting is the only counting method.


If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec.

@eonil Regarding grapheme clusters:

The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience.

commented

@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used.

Sorry to say but this is some kind of WCHAR legacy at full force.

From my point of view, use grapheme clusters is much better than rely on any specific encoding, even if clusters is not very stable:

We are talking about source code, use of "advanced" or "unstable" grapheme clusters is rare in source code.

Conversion from UTF-16 code units index to grapheme clusters is necessary to be done by editors to report messages anyway. It requires to have source code loaded and do processing of this source code to compute user visible position of the grapheme cluster. It complicates clients code.

Some compilers already have basic support for reporting of grapheme clusters and not any kind of representation specific indices.

Grapheme clusters are impractical. They depend on Unicode version, and I can imagine things would get very confusing when client and server support different Unicode versions (in fact, the behaviour of grapheme clusters did change in Unicode 12.0, the most recent version at the time I wrote this comment). Additionally, chances are most server implementations would get lazy and simply not bother to support grapheme clusters, as most programming languages don't make it easy to support grapheme clusters, and not supporting those won't cause issues in 99.999% of the cases.

commented

Concerning those LSP implementers who are unhappy with the situation:

The most practical solution is to vote with your feet, and send the document offsets of the UTF-8 representation.

That would be a direct violation of the protocol unless both, servers and clients, implement custom ways to negotiate UTF-8.

commented

That would be a direct violation of the protocol

Yes, that's basically the point. :-)

I think it's completely unrealistic to believe that any LSP implementation will get easier from having to support both encodings, implementing the encoding negotiation and having to support the UTF-16 offsets til the end of times due to some odd editor that can't be bothered.

The LSP implementers complaining here can resolve this issue by migrating their implementation to codepoints and declaring that they won't support the legacy UTF-16 offsets.

It's either this, status quo, or the worst option: supporting both.

We definitely need a way to negotiate utf-8. For vim-lsp the performance was really bad. Benchrmarks in the PR prabirshrestha/vim-lsp#284

Vim is one of the most popular languages but vimscript is also one of the slowest language in the world.

That would be a direct violation of the protocol

Yes, that's basically the point. :-)

That's also disregarding that most servers only test against vscode. That's going to be a problem for every client that is not vscode.

For vim-lsp the performance was really bad.

Ycmd didn't measure the performance impact of counting UTF-16 offsets. We just went with it. Though I doubt it's as drastic in python as it is in vimscript.

commented

@bstaletic I appreciate your concern. To make clear where I am coming from:

I found this issue because I was investigating how to write an LSP implementation for a language in the coming months.
As I'm doing this on my own time, my budget for accommodating legacy cruft is roughly zero.

Therefore, my definitive statement on this matter: My implementation will use codepoints, and will support neither UTF-16 codeunits nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

@soc I understand where you are coming from but I really doubdt that you are going to get anywhere with this kind of obstructionist approach.

For example, as a server author I really don't much care what you do. As long as our server works with vscode, Eclipse, atom and maybe soon IntelliJ... we are happy. These are the LSP clients that we care about... pretty much. They implement the standard (or at least they try to :-). And you are making it hard for our servers to work correctly with your client. If you think that this way you can force the issue... you are wrong. Your client is not on the list that we care about. It does not affect us. And on the off chance that somebody actually raise a bug about this with us to say our server doesn't work properly with your client... guess what... we will just point the finger right back at you and move on with supporting the clients we actually care about.

@soc I've commented on the theme of forking before, there's some discussion on other issues related to it here.

commented

@kdvolder Thanks for your kind words. I have to disagree on the characterization of the approach as "obstructionist" though – I would consider it to be a results-drive approach.

One of 4 things will happen:

  1. Nothing will change, people are unhappy.
  2. Negotiation will be introduced, forcing developers to implement support for both Unicode codepoints and UTF-16 codeunits. Lots of implementation complexity, lots of finger-pointing and arguing whose job it is to implement. People will be even more unhappy.
  3. LSP implementers migrate to codepoints on their own. Problem gets resolved within weeks.
  4. After a long discussion, everyone agrees to switch to codepoints. Problem gets resolved in years.

Due to my limited time, I'm forced to pick number 3. I could wait for number 4, but that would incur some delays which aren't strictly necessary.

From my point of view, number 3 is the best approach to resolve this issue, especially as clangd and rls are already considering this too.

If there are other approaches I may have missed, I would be happy to learn about them. Thanks!

It is obvious that in the perfect world option nr 4 would be best, but apparently it is not the option that will ever happen.

This issue is over 1 year old. Since the day that issue got created number of issues on this project went from ~70 to 154 today. The ideas of the universal protocol are great on paper, but it looks like execution of this ideas are made without universal thought. Instead it appears that driving force is the ease of implementation for Microsoft tools and since they are happy enough with the protocol its development slowed down.

Option nr 2 is IMO the worst. It is better to not do anything than to introduce negotation and in the end need to support both UTF8 and UTF16.

And so in the imperfect world we live in looks like the best option we have is option nr 3 (or forking), and if enough people will follow it will work out for better for everyone.

Therefore, my definitive statement on this matter: My implementation will use UTF-8, and will support neither UTF-16 nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

I suspect then, that you'll just have a server that nobody uses. Or one that occasionally breaks in unfortunate ways. Client implementers (like me) are not going to write to a non-conforming server implementation, for the same reasons: we're doing this in our own time, and we don't have the time to write and test server-specific code for a technically broken implementation. We just won't support it, or our users will just have a bad experience and blame us.

Having a specification that is clear, albeit not ideal, is better than having 2 competing specifications. I agree that UTF16 is unfortunate, and that in ycmd we had to write a bunch of fiddly code to support it and we had to write a bunch of fiddly tests to test it. But at least we only have to do that once. That's the real power of LSP (with the caveat that most implementations are somewhat nonstandard and the protocol itself includes the requirement for server-specific knowledge in the client : commands).

  1. LSP implementers migrate to UTF-8 on their own. Problem gets resolved within weeks.

It sounds a tad optimistic to assume that within weeks... all existing clients and server will adopt UTF-8. Especially considering that it goes against the standard. Maybe it helps you get on with things, so I can understand you might just do that (and hey, it probably doesn't matter unless the user starts typing their code with some bizzarre unicode characters rather than typical plain ASCII) but it hardly 'resolves' the issue, does it now.

@puremourning Many clients use UTF8 or codepoints already. Most people don't notice because astral chars are uncommon in source and ranges are only incorrect when an astral char is in the line your using.

It would be interesting to survey known clients and servers to see what they are actually using.
Edit I'm making a survey at lsp-range-unit-survey

You said "uncommon", I said "occasionally". I think they are equivalent.

However you interpret them, the result is still that you get bad user experience when it happens. The user doesn't care that their code contains "uncommon" symbols, just that their experience with the product was bad.

Moreover, we have the test cases and bug reports that prove that "uncommon" is not the same as "never".

@puremourning Absolutely, an "occasional"/uncommon issue is still a problem. That's why I think this issue should be resolved in relatively short order. I emphasized "uncommon" to imply that changing to units is not a very bad breaking change compared to the situation that I have observed in the implementations I've used (~4 utf16, ~4, utf8, ~2 codepoints) Hence I am making a survey to definitely know where we currently stand with compliance.

commented

You said "uncommon", I said "occasionally". I think they are equivalent. However you interpret them, the result is still that you get bad user experience when it happens.

This is already the case: as @Avi-D-coder mentioned, more than half of the implementations he checked ignore the spec in this regard.

Having a specification that is clear, albeit not ideal, is better than having 2 competing specifications.

I think then the solution that makes everyone happy is clear: Update the specification!

@soc My little count is anecdotal and by memory, That's why I made github.com/Avi-D-coder/lsp-range-unit-survey. Please help by sending PRs.

My implementation will use UTF-8, and will support neither UTF-16 nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

I've done the same in my language server ccls: it only implements UTF-8. This is not a big issue in practice because people rarely use non-ASCII in the C/C++ code. When they do (in string literals (that doesn't affect characters in other lines), nearly never in identifiers), it is not a problem: the existing Emacs/Vim language clients support UTF-8.

@Avi-D-coder thanks for doing the survy. It definitelly helps to make a more informed decission.

As others I am not a fan of making this negotiable on both ends since it doesn't help in any ways. I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

If we really come to the consolusion that an additional format is necessary then the only reasonable way forwards for me would be the following:

  • we come to a conclusion which other formats/encoding should be supported. I am in absolute favor to only add one.
  • there are a lot more servers than clients (https://microsoft.github.io/language-server-protocol/implementors/servers/). So we only allow servers to pick an encoding / format. Clients need to support ALL encodings / formats. This limits the implementation effort.
  • we vote by PR and not by feet :-). This means that people pushing for this help client implementations to support the additional format (e.g. by providing corresponding PRs).
  • as soon as the additional encoding / format is supported in the clients we add a capability to the protocol that tells servers that they can choose between two formats.

Actually I am not so sure anymore about that approach. It will force the client to open the document to do the conversion even if it is not presented in the editor. The server usually has the content of the files read for which it reports results.

I'm about to land support for UTF-8 in clangd.

clangd is utf-8 internally, and abides by the protocol by transcoding. However many clients only support utf-8, and we want to work with them.

We've got a backwards-compatible protocol extension for negotiating encoding:
https://clangd.github.io/extensions.html?#utf-8-offsets

For clients/servers that only support one encoding, this is very simple to implement: just drop in a single static property on ClientCapabilities/InitializeResponse.
I'd suggest clients/servers that care about this problem also implement this extension.

clangd will also support a -offset-encoding=utf-8 flag as a user-acccessible workaround for clients that only support UTF-8 and don't implement this extension.

@dbaeumer I'm happy to send a pull request for a protocol change if that seems useful to you. I've implemented this in a server and will likely also add it to clients. I'm not likely to send a PR to the nodejs client/server though (others are of course free to do so).

EDIT: the clangd implementation for reference: https://reviews.llvm.org/D58275
(This is nontrivial because clangd will support both utf-16 and utf-8)

@sam-mccall thanks for offering your help here. But what we need (if we want to do this at all) is supporting a different encoding in clients. Updating the protocol spec is trivial compared :-)

As far as ycmd (a client) is concerned, we're in the same boat as clangd. We do everything in UTF-8 and then have some piece of code to calculate UTF-16 offsets, so going back to UTF-8 would be easy.

Perhaps having the extension in the protocol specification would help clients adopt it.

@dbaeumer Agreed we need implementations, though specifying this may encourage them as @bstaletic says.

Server implementations of multi-encoding support might be just as valuable as client ones: if servers support multiple encodings, we get concrete interop wins (with utf-8 only clients) by having the clients blindly request UTF-8 (which is a trivial change).

While servers outnumber clients, they are also often written in fast (or fast-ish) languages with good library access for transcoding, vs clients that are often written in slower languages with limited libraries.

commented

@dbaeumer

we come to a conclusion which other formats/encoding should be supported. I am in absolute favor to only add one.
So we only allow servers to pick an encoding / format. Clients need to support ALL encodings / formats. This limits the implementation effort.

Isn't that pretty much the worst case scenario detailed above?

I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

Me too, if I can avoid it. So what exactly is preventing us from keeping LSP 3 as-is (UTF-16 offsets) and release LSP 4 with i. e. UTF-8 offsets?

Compared to the approach of implementing negotiation machinery and forcing servers to support both, releasing an updated spec version means that there is a definite EOL to UFT-16 offsets, instead of having to support them forever:

As soon as server implementers feel that the clients they care about all implement LSP 4, they can drop whatever workarounds they have for UTF-16 and move on, without being weighted down with legacy baggage in the protocol or the implementation.

Isn't the version of the protocol arbitrary? There's no version identifier in the init exchanges.

As far as ycmd (a client) is concerned, we're in the same boat as clangd. We do everything in UTF-8 and then have some piece of code to calculate UTF-16 offsets, so going back to UTF-8 would be easy.

Perhaps having the extension in the protocol specification would help clients adopt it.

https://github.com/Valloric/ycmd/blob/master/ycmd/completers/language_server/language_server_protocol.py#L496-L532 is the code.

commented

I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

While I agree with you that not sending the protocol version during initialization is a major oversight, the comment regarding LSP 3 vs. LSP 4 is less about mechanical protocol negotiation, and more about client devs declaring "we upgraded to LSP 4" and server devs changing their implementation accordingly.

It's largely a matter of having something like "LSP 4" as a short marker of the change instead of "we changed the encoding of the values inside some nested structure", especially when it comes to communicating the fix to users, which might inquire about the status of this fix in the client they are using.

I would like to point out that there is a difference between byte and UTF-8 codeunit. Whenever you guys are talking about UTF-8 I am not sure which one do you mean.

If some source file would be encoded in ISO-8859-1 for example, would all of those LSP implementations using "UTF-8" actually convert the encoding to UTF-8 and use UTF-8 codeunits, or would they actually use bytes? For example letter ä (0xE4) encoded in ISO-8859-1 would be one byte, one codepoint but actually two UTF-8 codeunits.

@Avi-D-coder Maybe it would be a good idea to distinguish between the two in the survey?

That's a great point. Like I said, ycmd does all internal work in UTF-8 and only converts offsets to/from UTF-16 when talking to a LSP server. That means an ISO-8859-1 encoded file will result in UnicodeDecodeError exception and ycmd will stop working.

commented

I'm referring to UTF-8 codepoints. The advantage compared to counting bytes is that – except with characters above the BMP– it doesn't break existing users.

We know this, because according to the survey half the implementations we know about already do this. The world hasn't ended.

Let's move the other half over and be done.

There is no such thing as UTF-8 codepoint. There is Unicode codepoint, UTF-16 codeunit (currently LSP uses this), UTF-8 codeunit and byte. Unless I am mistaken - I am not a member of Unicode standard committee.

And since survey says UTF-8 it is not at all clear, at least for me, if it means UTF-8 codeunit or byte.

I have a separate category for Unicode codepoint. I was assuming people understood UTF-8 to means UTF-8 codeunit.
RLS for instance uses codepoints (rust's char). It maybe a good idea clarify this.

commented

@szatanjl A codeunit in UTF-8 is 8 bits, it is equivalent to counting the bytes.
A codepoint is 21bits, only its encoding differs between transfer formats.

You are correct that "UTF-8 codepoints" doesn't make much sense because a "UTF-8" codepoint == "UTF-16" codepoint == ...

Unicode codepoints is what I failed to express properly. They have the benefit that the result of queries to the language server returns the same value as before.

@szatanjl A codeunit in UTF-8 is 8 bits, it is equivalent to counting the bytes.

Maybe I didn't make myself clear. By "bytes" I meant bytes in the encoded file, not bytes of the file in memory that at this point might be converted to UTF-8.

@soc And with above in mind you are mistaken. UTF-8 codeunit is equivalent to counting bytes if and only if file is encoded in UTF-8. If the file is encoded in ISO-8859-1 for example then counting bytes and UTF-8 codeunits is not the same. You can look at my above comment for example.

commented

@szatanjl The offset values refer to data sent as UTF-8, and that data is sent as UTF-8 regardless of the encoding of the original file.

@soc The offset values refer to document that doesn't have to be UTF-8 encoded.

Citation from specification, section "Text Documents" > "Position":

> interface Position {
> ...
>	/**
>	 * Character offset on a line in a document (zero-based). Assuming that the line is
>	 * represented as a string, the `character` value represents the gap between the
>	 * `character` and `character + 1`.
> ...

It states "Character offset on a line in a document". Not in a sent data.

commented

@szatanjl Sorry, I got confused. You are correct.

Do we have any understanding whether or how a switch from UTF-16 codeunits to Unicode codepoints would impact other legacy encodings such as ISO-8859-1?

Do we have any understanding whether or how a switch from UTF-16 codeunits to Unicode codepoints would impact other legacy encodings such as ISO-8859-1?

@soc As far I know it would depend on clients handling the text. All the servers I know of reincode internally.

The biggest problem with the current UTF-16 codeunit is that many clients and editors can't or won't conform. Mandating UTF-16 is the equivalent requiring the client of keep an extra copy of the text just to handle astral chars. Most non vscode based clients that I know of will never do this. While there are certainly more servers then clients servers are in a better position to handle complexity. Clients want to be as thin as possible.

I will survey the clients today.

At this point I will not be opening any more issues for the survey, if people want to add more data points, open a issue or ask a question on a lsp repo using the template in all those issues and send the survey a PR.
I will continue to update the survey as results come in.

commented

If LSP spec is fully based on UTF-8, everything can be far more simpler. Therefore, there must be a clear and strong benefit to introduce UTF-16 to justify extra abstraction and implementation cost.

The only known benefit of involving UTF-16 is eliminating extra offset transformation in some systems based on UTF-16. I still don't understand why these UTF-16 based systems deserve such subtle level of extra optimization involving unnecessary extra dependencies at protocol level. Isn't it best to hide such dependencies into local machine? Why do we need to make protocol far more complex to benefit some platforms? LSP is already mostly based on UTF-8 and involving UTF-16 requires far more details to deal with such as endianness, BOM, UCS2, surrogate pairs, and etc.. Why everyone has to pay these cost for those special platforms?

In my opinion, if grapheme clusters are not acceptable, next best option would be something with UTF-8 whatever. I really can't find any reason to deal with UTF-16 at protocol level.

And encoding negotiation would make situation worse, because every implementation has to implement extra abstraction/transformation layer or afford less compatibility.

Use of Unicode Code Points may be good starting point for all clients and server. Grapheme clusters are much better, but... they are little bit more complicated. In this case it is not important which encoding is used by document. It can be UTF-8 or KOI-8R or CP 1251. The first one use 1/2/3 byte sequences to represent characters of last two.

Most client/server developers wants to don't support all complexity of character encoding because it provides no value in their eyes. It is fine till it is sufficient to use ASCII in source code. Outside of ASCII characters set computations is more complicated and, most important for developers, it requires to rewrite a lot of code to add such support. Each developer can select own way, but LSP specification should be useful for processing of documents in any encodings, thus needs to use some encoding independent way to address positions of user visible characters.

Here's my attempt to summarize the technical issues and alternatives.
I'd encourage people to add the offsetEncoding extension to their client/server, especially if it uses a fixed encoding that's not UTF-16 (in which case it's trivial).

Servers get open file content via LSP, and non-open files (e.g. imported) from disk. So servers always need to be aware of on-disk file encodings, in order to communicate about offsets consistently.

Everyone's using unicode, encodings used internally and on-disk vary. Unicode codepoints are the common representation (everything else is one hop away).

The alternatives we've discussed:

  • byte offset in file (regardless of encoding) - I think this basically doesn't work: we don't have this info for content sent over LSP. Also in practice, this doesn't fit well with the APIs available to editor plugins (clients) and parser libraries (servers).
  • codepoint (i.e. UTF-32 code unit) - neutral option that ensures just two conversions in the worst case (one on client and one on server). Easy to understand and reasonably consistent with the LSP protocol. Fewer illegal cases to consider (e.g. splitting surrogate pairs).
  • UTF-16 code units - status quo. Easy for UTF-16-native clients/servers, very hard for others. Worst case is 4 conversions needed (client-native -> codepoints -> utf-16 -> codepoints -> server-native). Inconsistent with rest of protocol, which uses unicode and UTF-8.
  • UTF-8 code units (bytes) - in the abstract similar to UTF-16: easy for some clients/servers, hard for others. More common interchange encoding, consistent with the rest of LSP.
  • grapheme clusters - Compatibility issues across unicode versions. Hard to implement without libraries. No illegal cases to consider.
  • dynamic negotiation - Allows correctness when one side is multi-encoding aware and the other side doesn't support UTF-16. (Most commonly, UTF-8 only clients or servers). Improves performance when client/server share a native representation that is not the standard UTF-16.

Based on this I'm coming around to the idea that codepoints (i.e. "utf-32") might be a sensible compromise. UTF-16-native clients are numerous (JS, Java and C# are everywhere) and dealing with UTF-8 is almost as annoying for them as UTF-16 is for UTF-8-native clients. Counting codepoints is pretty easy in both representations.

I would encourage the use and eventual standardization of the negotiation extension to help get out of the current mess.

commented

@sam-mccall I'm not really seeing how introducing negotiation and forcing implementers to support multiple encoding variations can be considered getting out of this mess.

From my point of view, this is making the mess even bigger than just giving up and accepting status quo.

forcing implementers to support multiple encoding variations ... is making the mess even bigger

Thanks for raising that, I agree!

I'm not proposing anyone be required to support multiple encodings, instead:

  • the proposal allows implementations that only support one encoding to trivially specify that in their capabilities request/response
  • it allows implementers that want to support multiple encodings to choose the appropriate one
  • it doesn't change the current half-working behavior when the client and server use different encodings, but allows either side (or a viewer of the logs) to detect that scenario.
commented

Isn't the end result

lots of finger-pointing and arguing whose job it is to implement

in the end, as predicted in #376 (comment)?


The collective time people here spent discussing whose-job-it-is and complicating things with encoding negotiation is probably already close to the time it would have taken to simply fix all clients and servers.

Let's not turn this into some multi-year design-by-concerned-middle-management project, please.

We have already a pretty good list of clients and servers, so let's get this done.

As @dbaeumer said, "we vote by PR and not by feet":

  • I offer creating a PR against the LSP spec and against one or two random LSP implementations to migrate them from $randomThing they are doing to Unicode codepoints.
  • If everyone chips in to at least notify implementers of the fix, I'm sure we can largely be done by next week.

If everyone chips in to at least notify implementers of the fix, I'm sure we can largely be done by next week.

I completely disagree. I think you are hugely underestimating and trivialising the work being created here for compliant implementations.

@puremourning By changing the spec from characters to UTF-16 codeunit a lot of work was already created.
At this point I am not at all confident that we can consider compliant implementations to be the majority.

Implementations Count

  • UTF-8: 11
  • UTF-16: 10
  • Codepoints: 6
  • grapheme clusters: 0
    note: Multiple implementations in the same repo or derived from a shared dependency are counted once.
    Several compliant implementers would prefer UTF-8.

Given the data so far Code points seem like a decent compromise.

  • Only astral chars break.
  • Codepoints are available in just about every language.
  • Are basically made for bridging encodings purpose.
  • Many people originally interpreted "characters" from the spec as codepoints.

Ok but the quotes comment claims that we confirming implementations would change in a week, which is a baseless claim as we like most people here do it in our spare time and we don’t abide random baseless deadlines set so that others can avoid implementing the specification.

@puremourning A week is being extremely optimistic, but this issue has been open for a year and the longer it takes to switch the more implentions switch to UTF-16. At least 2 are in the process right now.

If the spec is going to be reverted it should be reverted now.
If the survey data is representative, the spec is broken, not the majority of implementation using something else. Thus changing to codepoints should not be considered a breaking change. It should be considered fixing and erroneous edit to the spec, or a clarification. The word "character" is not synonymous with UTF-16 codeunit. If the majority of implementations after a year are still non conformant, what does that say about how people originally interpreted meaning of character.

With that being said, I don't know that the majority is non conformant. It could be that the sample is not representative. I picked it based on stars and easy of inquiry.

Thanks all for the lively discussion. I read through the new posts. I have a couple of comments and clarifications:

Regarding versioning

LSP has feature version support (this is why we don't send a global version number). If we add new capabilities to the protocol they are guarded by a client capability and / or server capability. If we would add another client encoding this would be done as follows in LSP:

  • we add a client capability encoding
  • we add a server capability acceptedEncoding.

Assuming we have a client has utf-8 encoding only encoding would be set to utf-8. If the server accepts it it signals this through acceptedEncoding. If acceptedEncoding is unset then the client knows it is a standard utf-16 server to which he can't talk.

Regarding protocol using UTF-8 & UTF-16

The only part that is UTF-8 specific in the protocol is how the JSON structure should be be encoded to bytes when sent over the wire. This encoding is customizable in principle (see Content-Type in the header). Adding another one would be trivial since it is comparable to converting files from disks into strings using an encoding. Almost all programming languages I know of have support for various encodings. The reason why we choose utf-8 is simply transfer size. It has IMO no extra cost neither on the client nor on the server.

Things are complete different with positions, since they denote a character in a string. The reason why we choose utf-16 when starting with LSP was that the programming languages and editors we looked at and used were using utf-16 to represent strings internally. Things have changed since then and all new programming language usually represent strings internally using utf-8.

@sam-mccall Thanks for this great encoding summary

I am also not sure anymore if it is a smart idea asking clients to support n encodings. The reason is that to actually convert positions from one encoding to another the content of that line must be available. That for example means for a find all reference result to open all files, read them into memory, do the position conversion and forget them again. This might have a bad performance impact especially when files come from remote.

I do agree with @sam-mccall now that doing the conversion on the server although there are more servers than clients is smarter for the following reasons:

  • as @sam-mccall mentioned they are also often written in fast (or fast-ish) languages.
  • server usually have the file content in question already in memory.

I also tried to look at this from a different angle. Instead of focusing on the clients and server it might be better to focus on the programming languages used to implement them. These usually determine how strings are encoded in memory and how they are indexed (if indexable at all). I came up with the following table so far:

Language Encoding
JavaScript UTF-16
TypeScript UTF-16
.Net (C#) UTF-16
Java UTF-16
C/C++ byte (UTF-8, UTF-16)
Go byte (UTF-8)
Python byte (own format)
Rust UTF-8 (no indexing)
Ruby UTF-8 & UTF-16
Lisp unknown
Haxe platform
vimscript UTF-8

So may be an approach would be the following: instead of helping clients to support an additional encoding besides utf-16 the LSP community invests into libraries for the common programming language to do the position conversion into another encoding. Then many servers (or even clients) could simply reuse these libraries.

commented

@dbaeumer I'm not sure I understand the exact intention of the approach:

do the position conversion into another encoding

Could you clarify what conversions you have in mind here?

LSP community invests into libraries [...] Then many servers (or even clients) could simply reuse these libraries.

Isn't such functionality (at least if I understood you correctly) something that can usually be accomplished by single method call to a method that most likely already lives in most standard libraries?

Weeks is not a realistic estimate even if all fixes landed today. For our server, the next release is in 6 months and I wouldn't expect ~all users to be on it for 2 years.

@dbaeumer The protocol you suggest looks OK to me, if we are assuming clients will advertise support for only a single encoding, and therefore ~all servers should implement multi-encoding support. In this case I think servers should be strongly encouraged to support UTF-32 as well as 16 and 8, as some clients will need this. Happy to write a patch to update clangd's negotiation once there's a PR for spec text.

clangd has a decent implementation of the length conversions for a UTF-8-native implementation, in case anyone wants to port them to other languages.
measureUnits and lspLength are the key functions.

commented

Weeks is not a realistic estimate even if all fixes landed today. For our server, the next release is in 6 months and I wouldn't expect ~all users to be on it for 2 years.

How is this an issue?

It will not cause trouble to anyone. Otherwise we would have that trouble today. Because almost 60% of the implementations don't follow the spec as we speak. If the world hasn't ended yet, fixing this mess will not end it either.

How is this an issue? It will not cause trouble to anyone.

Your proposal is IIUC to simply change the spec to say unicode codepoints instead of UTF-16 code units, and start fixing clients/servers.

Clients/servers that are spec-compliant (UTF-16) work together today. e.g. clangd 8 and vscode N. If you change the spec and commit fixes, and those are released in clangd 9 and vscode N+1, then e.g. vscode N+1 won't work properly with clangd 8. This situation will persist for years.

I understand some clients/servers are broken today, but your proposal will break some that work today.
(I'm going to leave it at that, because I don't think there's any prospect a change without back-compat will be accepted for the spec)

commented

This situation will persist for years.

The situation has already persisted for years. It's fine. Literally nothing happened. It's such a non-story that even most developers of clients and servers have not realized that something was wrong.

but your proposal will break some that work today

If you are not trying to argue that every (non-)broken server implementation only happened to be used by a client implementation that was (non-)broken in exactly the same way – by pure chance or magic – then it's absolutely clear that clients and servers who implemented the spec differently have already been used together for years, largely without anyone even realizing it.

I agree with @sam-mccall that we can't simply change the spec. This is very unfriendly to eveyone that adhered to it until now and something I really try to avoid.

As I outlined as well we should try to avoid that clients need to do position transformations to its internal string representation since this requires to have the content of the file loaded into memory. This is why I tend to agree that servers should support more than one encoding.

IMO to move this forward we need to do the following:

  • agree that putting the burden onto the server is the right thing to do. If we do
  • agree which encodings a server should support
  • provide corresponding helper libraries to do this for the programming languages commonly used in servers. @soc converting the content is trivial and most language have libraries for this. But converting an index into the string is non trivial as @sam-mccall showed with his code.
  • define how a client can pick an encoding and add this to the protocol in a non breaking way.
  • fix the servers.

@dbaeumer

agree that putting the burden onto the server is the right thing to do. If we do

I think we can all support this.

agree which encodings a server should support

While I would prefer a single codepoint api, since all editors must deal with emoji they all have codepoint indexing capabilities built in.
If multiple encodings are going to be supported, UTF-8, UTF-16 and Codepoints should be available. Grapheme Cluster are not presently used by any known implementation and are not unicode stable, so should be excluded.

commented

@Avi-D-coder I have trouble to understand how expecting servers to provide and maintain two additional implementations in addition to the existing one is an improvement over the status quo.

agree that putting the burden onto the server is the right thing to do. If we do

I think we can all support this.

No! You don't speak for all of us. As a server author I really don't want to be dealing with trying to support multiple encodings. For crying out loud. Just please pick one already! UTF16 works fine for us. But if you really must than change it to UTF8 or whatever... do it. But please ... just don't make server implementors support an array of different encodings.

I agree with @sam-mccall that we can't simply change the spec.

I think we can. It's a choice we can make.

This is very unfriendly to eveyone that adhered to it until now and something I really try to avoid.

Right, I would sort of agree. On the other hand making us support multiple encodings isn't exactly 'friendly' either.

Personally, our language servers are implemented in Java, so I think they 'accidentally' adhere to the spec. So the status quo works fine for us. But I think I can honestly say that I'd rather deal with changing our lanuage servers to support UTF8 (or whatever is chosen) over supporting both UFT8, unicode points and UTF16. One encoding is really enough.

@soc I don't expect most servers will maintain multiple formats. I am for Codepoints being the official and only format. UTF-16 should be deprecated and phased out slowly on the clients that support it.

However if UTF-16 is not being removed I fail to see why UTF-8 implementers would choose to sacrifice performance and convenience in return for no compatibility guarantee. I see supporting all three units as recognizing the broken state of the spec in this lucky relatively insignificant aspect.

If you take a look at who conforms to UTF-16, and who won't conform; it goes mostly along the traditional line: "enterprise" vs "modern" languages and editors.
Editors/Clients almost all have the ability to use codepoints. Codepoints are the clear compromise, but the compromise seems to have been rejected, so let darwin figure it out. Without the incentive of compatibility with all servers, I am not confident many UTF-8 devs will switch.

As for codepoints I still have hope people will slowly adopt it, if it's an option.

@kdvolder fair enough, but putting the burden onto the server does not necessarily mean multiple competing formats, I would consider it to be a general principle. Clients are often in slow or crippled languages like vimScript or elisp and even some sh in one. If the spec demands a heavy client, the spec will not be followed (this is already happening).

it goes mostly along the traditional line: "enterprise" vs "modern" languages and editors.

I really didn't understand this.

Clients are often in slow or crippled languages like vimScript or elisp

OK .. easy! Many servers are written in old crippled languages like javascript too, but I don't really see why that is relevant to the conformance or otherwise with a specification.

But seriously, this has nothing to do with age or crippledness of language implementation, other than precisely what @dbaeumer already said. A number of implementations internally used UTF 16 code units for the "string length" operation and for "string indexing".

The language-du-jour is likely not the bottleneck to he performance issue which, as has been accurately reported, is that to convert offsets, you require the line in the file in a known encoding. This is an I/O operation which is more likely to be a performance determining issue than any reasonable runtime language, even an interpreted one (or old and crippled if you prefer to throw around such terms),.

For what it's worth, I don't even subscribe personally to the argument that servers are more likely to have the file in memory. If the client is sending a request for a URI, it probably has that file in memory. It probably sent the contents the the server in the first place. If not it's probably just about to open the file to do something with the offset anyway. But that's neither here nor there in my opinion.

Aaaand finally, I do agree that just making a decision (rather than decision by committee) is probably better. I think any option that doesn't involve multiple encodings is a good enough option, i.e. either:

  • Do nothing and call any currently non-conforming implementation broken
  • Change the spec (per the PR), apologise to the good citizens who implemented the spec, and move on.

Both options will require some servers in some languages or runtimes to have to do conversion, but at least we have a clear and unambiguous specification (like we do now), and a likelihood that it will be followed (like we don't now, allegedly).

That for example means for a find all reference result to open all files, read them into memory, do the position conversion and forget them again. This might have a bad performance impact especially when files come from remote.

@dbaeumer But isn't it necessary to open the files anyways for "find all references"? What does a client do with the result? Typically you need to display some context, so at least the line that contains the symbol. So I don't see how "open up files unnecessarily" is an issue.

I say keep it as is, it's done. Though if it is changed then make it code points.

the protocol was design to support tools and their UI

Which is why I don't really understand the benefit of bringing encodings into this at all. It just increases complexity of the protocol and favours one language over another.

Change it to another encoding and it benefits client/server implementation language X at the expense of client/server implementation language Y. Language Y devs and CPU have to do more work.

Change it to many different encodings and it benefits client language X at the expense of server implementation language X, Y, Z. All server devs and CPU have to do more work.

So if it must change, keep it fair -- code points -- so that everyone suffers equally ;)

@haferburg most lsp clients are implemented against extension API (this is even the case for the VS Code LSP client). So the conversion usually happens before the data actually reaches the editor / tool. This extension API normally only surfaces position API tailored to the string representation used in the editor. So if the editor internally uses UTF-16 the API is UTF-16 based. So even if the editor later on opens the file the LSP client has to do the same. Things might be different if we can convince editor API owners to support multiple encodings (which I doubt will happen)

commented

@dbaeumer I think the crucial point is not that people don't know how to do it, but they don't want to.

Which is why I don't really understand the benefit of bringing encodings into this at all. It just increases complexity of the protocol and favours one language over another.

Except that there are good reasons to pick UTF-8 over UTF-16:

  • UTF-8 is the most popular Unicode encoding;
  • the misconception that UTF-16 is a fixed-length encoding. Lots of software using that encoding don't properly support Unicode because of that assumption. I wouldn't be surprised if VS Code is one of them. UTF-8 is well known to be a variable-length encoding;
  • UTF-8 is endianness independent. By the way, the specs don't mention which endianness is used for UTF-16: little or big endian?
  • UTF-8 is taking less space than UTF-16 in this context since code is mostly written in ASCII;
  • UTF-8 is already used to transmit the data.

If we agree on changing the encoding to UTF-8 in the spec then (and only then) we should discuss on the offset to use. There are two reasonable choices with that encoding (contrarily to three with UTF-16; another reason to use UTF-8):

  • code point offset;
  • byte offset (or code unit offset if you prefer).

I am in favor of using byte offsets because they directly represent the index of the encoded string while a Unicode representation is needed for code point offsets.

@soc Hmm, but I don't see how this is feasible since different programming languages use different encodings. So someone needs to convert (see #376 (comment))

@micbou Some comments regarding your post:

UTF-8 is the most popular Unicode encoding;

I agree when it comes to storing text in files but not representing text in memory in programming languages (see #376 (comment))

the misconception that UTF-16 is a fixed-length encoding

VS Code does handle surrugate pairs correctly :-)

UTF-8 is endianness independent.

whether the programming language uses LE or BE to store the string in UTF-16 in memory has no impact on the position information (character index). This is why it is not mentioned in the spec.

UTF-8 is taking less space

Agree in regards to space. But programming language usually come with on fixed internal representation. I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

UTF-8 is already used to transmit the data

Yes, and this is for size reasons. As I tried to explain here #376 (comment) these are two orthogonal issues. It is like Java can read the content of a file in memory that is encoded in UTF-8 although its internal string representation is UTF-16.

UTF-8 is the most popular Unicode encoding;

I agree when it comes to storing text in files but not representing text in memory in programming languages (see #376 (comment))

Except that comment is biased towards "javascript-like" (TypeScript and JavaScript) and "java-like" (C# and java). Also calling C/C++ "UTF-8/UTF16" is wrong, because they are both completely encoding agnostic and any non-ascii encoding needs to be handled by a library or hand written code, not to mention that C doesn't really have a string type.

I'm sure the bias was just a result of familiarity with different languages, but every language I have actually had a chance to work with used bytes (I'm not counting java, because I have barely touched that language). Also, the list is missing vimscript which, I believe, has at least 4 clients written in it, which uses UTF-8.

UTF-8 is taking less space

Agree in regards to space. But programming language usually come with on fixed internal representation. I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

If ASCII were not the vast majority using LSP would have been a complete mess, because, according to the above survey, if you pick a server and a client at random, chances are that they won't be "talking" about the same encoding offsets. And yet people are largely unaware of the mess we actually have today.

Not only are users of clients and servers unaware of this mess, but some client/server implementers are also unaware.

If ASCII were not the vast majority using LSP would have been a complete mess

I think that's really a bit of an exageration. Even if you did use some esoteric characters in your code, you will most likely not experience a complete breakdown of the tooling. Instead, more or less the worst thing that happens is that things like the positions of error markers will be of by a few characters occasionally. The tools will, for the most part, be perfectly usable.

commented

dbaeumer: Hmm, but I don't see how this is feasible since different programming languages use different encodings. So someone needs to convert.

This is correct, but converting to UTF-16 codeunits is not going to happen.

@dbaeumer

I agree when it comes to storing text in files but not representing text in memory in programming languages (see #376 (comment))

The issue is that you are considering languages that chose UTF-16 because people thought that 16 bits would be enough to store a Unicode code point (Java and JavaScript), languages that are/were targeting a platform using UTF-16 (C#), and extensions of a language using UTF-16 (TypeScript). As @bstaletic said, you can't consider that C and C++ are using UTF-8, UTF-16, or any other encoding. In the Ruby case, according to this article, UTF-8 is more popular than other encodings supported by the language, in particular UTF-16. If we look at what recent languages are doing, we see that they tend to pick UTF-8 (e.g. Go and Rust) or UTF-32 (e.g. Python 3). Anyway, I don't think any of this is relevant to the discussion. We are talking about the encoding to use in a protocol, not the best way to represent internally a string in a programming language.

VS Code does handle surrugate pairs correctly :-)

I am not convinced when I see issues like microsoft/vscode#62286.

whether the programming language uses LE or BE to store the string in UTF-16 in memory has no impact on the position information (character index). This is why it is not mentioned in the spec.

Sure but that's only because the encoding used to transfer the data is not consistent with the offset one.

But programming language usually come with on fixed internal representation.

That's not the case of recent languages like Go and Rust. More importantly, language developers are choosing a fixed internal representation like UTF-32 (UTF-16 when it was still enough to store a Unicode character) to efficiently do operations like computing the length of a string or going through a string character by character without realizing that a character is not necessarily on one Unicode code point which make the optimization worthless (shout-out to the Python developers).

I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

I'd be interested to see a code base of a popular software (Asian or not) with more than 50% of non-ASCII characters.

Yes, and this is for size reasons.

But why is UTF-8 better than UTF-16 in that regard? Because code is mostly written in ASCII.

As I tried to explain here #376 (comment) these are two orthogonal issues. It is like Java can read the content of a file in memory that is encoded in UTF-8 although its internal string representation is UTF-16.

They are not orthogonal issues. Text is still text whether it's stored in file or in memory (or the internal string representation of a programming language which is the same as storing the data in memory). Using two different encodings for the same data is inconsistent.

Rather than writing more, I would just leave the good article on why UTF-16 should be abandoned everywhere, even in Windows world: http://utf8everywhere.org/

Even if you did use some esoteric characters in your code, you will most likely not experience a complete breakdown of the tooling. Instead, more or less the worst thing that happens is that things like the positions of error markers will be of by a few characters occasionally.

With TextDocumentSyncKind.Incremental you can end up with the server having the wrong idea of the content, and then reporting completely inaccurate diagnostics rather than just putting them at the wrong position.

With TextEdit you can insert incorrect content.

@natebosch

You are of course right, in theory it is possible. But I think you'd be hard pressed to come up with a real example where you can make the tooling go completely of the rails this way. There is kind of a 'limit' on how wrong the positions can be as the 'errors' reset at the beginning of every new line. They don't accumulate throughout the file.

Anyhow... clearly this is a real issue and needs to be settled/specced properly somehow, but I hardly think its as big a deal as the size of this thread would make one beleave.

This thread is hard to read. Big yak. So perhaps we need a new thread to discuss how and/or which implementations support UTF8. The 'why' has already been done; it is the only format that works for everyone. Looks like clangd already supports UTF8 LSP. I suggest a list. Dont ask me to do it, just here to give advice.

PS. BTW, this is hilarious, UTF16 in this day and age. LOL! Too true to be funny! Like @micbou said, UTF32 would make sense, but this!? Ahahahahahah! I literally burst out laughing when I found out! Zombie land!

Totally agree, making a list with various LSP servers and clients, split them on those supporting UTF-8, and those who doesn't. Something like I did for TrueColors support in various console emulators: https://github.com/termstandard/colors