clangd / clangd

clangd language server

Home Page:https://clangd.llvm.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UTF-8 mode

sam-mccall opened this issue · comments

clangd works natively in utf-8, many editors also work in utf-8.

LSP expresses character ranges in UTF-16 codepoints: microsoft/language-server-protocol#376 which causes server and client to do pointless complexity and busywork converting back and forth.

This needs to be opt-in. Could be a -utf-8 flag or some capabilities-based negotiation.

I would even as far as pushing for LSP client/servers tools to disregard the spec concerning UTF-16 and just use byte offsets (or codepoints), many implementations are adding needless complexity to compute UTF-16 coordinates when neither the client nor the server uses UTF-16 in the first place.

Using codepoints would break VSCode and UTF-16 editors only on lines containing astral plan characters before the given coordinate which should be relatively rare (I would expect those only to appear in comments, which in most programming languages extend to end of line).

Using byte offset would be the simplest, but would break more use cases.

The spec uses UTF-16 only because it matched some VSCode implementation details and it seems wrong to push that unnecessary complexity on many projects when Microsoft has more than enough resources to fix it in VSCode.

I agree with the thrust of what you're saying, but ripping out UTF-16 support from clangd doesn't make sense, and will break vscode. Rather we should make it work well with clients that don't want to use UTF-16.

Proposed protocol:

ClientCapabilities gains a field offsetEncoding, of type string[]. It lists the encodings the client supports, in preference order. It SHOULD include "utf-16". If not present, it is assumed to be ["utf-16"]

InitializeResponse gains a field offsetEncoding of type string. The character field of Position objects counts units in this encoding. It SHOULD be the one of the requested offsetEncodings, or "utf-16" if none are supported. If not present, it is assumed to be "utf-16"

Well-known encodings are:

  • utf-8: character counts bytes
  • utf-16: character counts code units
  • utf-32: character counts codepoints

In practice this means:

  • standard-compliant clients/servers will keep using UTF-16 if either side is unaware of this extension
  • clients and servers that both prefer utf-8 can negotiate to use it, if both sides use this extension
  • clients and servers that only support utf-8 can indicate this in the protocol very cheaply (no logic needed), allowing the other side to recover

I agree removing UTF-16 support now would not make sense, my point is that a specification only has value in that it allows the tools implementing it to interoperate, we do not have to follow the spec "because it is the spec", we have to follow it because its what allows our tools to communicate with each other. That means that we can, and I suggest we should, deviate from the spec as long as we agree on how we deviate.

I am saying that the VSCode team has little incentives to change things in the spec as long as it gets followed and tools are interoperable with their product.

It seems that most existing lsp implementation are already not respecting the specification on this point, some just use codepoint or byte counts (rust-lang/rls#1113, kakoune-lsp/kakoune-lsp#98, haskell/lsp#70, ...) other try to infer the UTF-16 column count from the UTF-8 byte count (jacobdufault/cquery#57), which is not a complete solution either.

So, seeing that

  • The spec is unlikely to change short term, and even if it does it will likely mandate UTF-16 support for backwards compatibility.
  • The spec is not the result of a concerted process, it is just encoding whatever VSCode happens to do in a document.
  • The spec is not widely respected on this point, either because implementation do not know (yet) the spec mandates UTF-16 code units or because implementing UTF-16 code unit computation is hard when your buffer is encoded in UTF-8 (which is mandated by LSP when transmitting the buffer content).
  • Not respecting the spec will only break lines containing non-ascii characters (if using byte coordinates) or lines containing characters requiring 2 UTF-16 code units (if using codepoint coordinates), when interacting with implementation that insist on following the spec on this point.

I think agreeing to use either byte or codepoint coordinates as the default, and making UTF-16 the complex opt-in choice (possibly with this extra negociation support) would be a better course of action.

In any case, I am all for adding an UTF-8 mode to clangd, it would be an nice precedent, I suggest exposing it as a command line argument, and let the client side (which usually is responsible for starting the server) decide they dont want to deal with UTF-16. In other words, I would recommend against the in-protocol negociation which would enshrine UTF-16 as a default.

I agree the way VSCode drives the spec is a problem. Fixing that is outside the scope of this bug and probably clangd. The focus here is improving interop without creating new interop failure modes (which changing the default would do).

I'd suggest clients/servers that only support utf-8 signal that with this extension, which will allow counterparts that support both utf-8 and utf-16 to interop with them.

Supporting a clangd command-line flag (e.g. -offset-encoding=utf8) would be useful, as it addresses the common case where:

  • client always speaks utf-8
  • client isn't aware of the extension
  • user can be made aware of this and set command-line args

Not respecting the spec will only break lines containing non-ascii characters (if using byte coordinates) or lines containing characters requiring 2 UTF-16 code units (if using codepoint coordinates), when interacting with implementation that insist on following the spec on this point.
This understates the problem: when sending incremental edits, as soon as an invalid edit (bad/misinterpreted offsets) is sent, the client and the server have an out-of-sync model of the document, and the server has no way to recover.

What about the index? Which encoding do we use when building the static index? Since static index is built offline, we don't have flexibility to switch UTF-8/UTF-16, my guess would be UTF-8.

Yes, an index built by an external process (including auto-index by a differently-configured clangd) is the main catch here.
Increasing the complexity and index size by storing both doesn't seem great.

My initial feeling is that since clangd already always has to deal gracefully with the index being out-of-date, index locations being off by a little bit actually has fairly tame consequences - go-to-definition is off-by-one within the line etc. It's not like the client/server where a mismatch can lead to a desync of document content.

In the common cases we'll get this right: dynamic index, auto-index with a single primary editor. For local static index, I think the builder should default to utf-16 with a flag. For niche cases like a shared hosted static index, we'd have to make a call. (I'd vote for UTF-16 so editors that follow the official spec get perfect results).

That protocol could be forked (with backwards compatibility), to have more open governance, RFC's, etc. It would be so much better that that option at least existed.

Extension added in r357102.

Documentation is https://clangd.github.io/extensions.html#utf-8-offsets

https://reviews.llvm.org/D59927 will add codepoint (i.e. utf-32) support.

@sam-mccall could you advertise this extension on the original LSP issue (microsoft/language-server-protocol#376)? I would have implemented this extension for rust-analyzer ages ago if I knew it existed.

Sure! I did mention it at (microsoft/language-server-protocol#376 (comment)) but it got buried in the flamewar :-) I'll post another message.

On the client-side, it looks like nvim's LSP module and kak-lsp have some support for it. Another server implementation would be great!