universal-ctags / ctags

A maintained ctags implementation

Home Page:https://ctags.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unicode file names aren't recognized

vToMy opened this issue · comments

Running ctags on Unicode file names fails to open them.

Example
For a file called:
こんにちは世界.txt
Running:
ctags --options=NONE *
Will produce:

ctags: Notice: No options will be read from files or environment
ctags: Warning: cannot open input file "???????.txt" : No such file or directory

Works for me:

$ ls
こんにちは世界.adoc

$ ../ctags --options=NONE -f - *
ctags: Notice: No options will be read from files or environment
Chapter 1 (Level 0)	こんにちは世界.adoc	/^= Chapter 1 (Level 0)$/;"	c
Level 3 Section 1.1.1.1 Title	こんにちは世界.adoc	/^==== Level 3 Section 1.1.1.1 Title$/;"	t	subsection:Chapter 1 (Level 0).Section 1.1.Subsection 1.1.1
Level 4 Section 1.1.1.1.1 Title	こんにちは世界.adoc	/^===== Level 4 Section 1.1.1.1.1 Title$/;"	T	subsubsection:Chapter 1 (Level 0).Section 1.1.Subsection 1.1.1.Level 3 Section 1.1.1.1 Title
Section 1.1	こんにちは世界.adoc	/^== Section 1.1$/;"	s	chapter:Chapter 1 (Level 0)
Subsection 1.1.1	こんにちは世界.adoc	/^=== Subsection 1.1.1$/;"	S	section:Chapter 1 (Level 0).Section 1.1

What operating system are you running on, and what version ctags?

OS: Windows 10 (Version 10.0.17134 Build 17134)
ctags version:

>ctags --version
Universal Ctags 0.0.0(2258b24b), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Aug 18 2018, 00:09:59
  URL: https://ctags.io/
  Optional compiled features: +win32, +wildcards, +regex, +internal-sort, +iconv, +option-directory, +xpath, +json, +interactive, +yaml, +case-insensitive-filenames

This is a long-standing issue on Windows.
We currently use ANSI APIs, but we need to use Unicode APIs to handle Unicode file names.
This is a hard work, though.

do you want to support this or not ?

I want to support it but I don't know how to do it.

Basically, we need to modify everywhere we handle filenames.
For example, we need to use _wmain() instead of main() to get UTF-16 command line, and need to use _wfopen() instead of fopen() to open a file with UTF-16 filename.
It might be better to create a wrapper layer for converting UTF-16 between UTF-8 and always use UTF-8 in the core part of u-ctags.

How long will you take to fix this issue?

@k-takata, thank you. Now I understand the meaning of "a hard work".

@Lennon925, I'm sorry but I have no plan to fix this.
We have to find a volunteer for fixing this issue.

As the first step, we have to add a test cast to Tmain.
Unlink, Units, there is no way to record a test case for a known bug.
Extending tmain test driver must be done first.

@k-takata, I tried a file having Japanese character as input for ctags on msys-2.
Unexpectedly, it works well. I think I'm doing something wrong. Could you give me more hints?

ctags-jp-filename

On Japanese Windows, we can use Japanese characters, however, characters that cannot be represented by Shift_JIS (e.g. alphabets with diacritical mark, simplified Chinese characters, ...) cannot be used on Japanese Windows. Similarly, Japanese characters cannot be used on English Windows.

A workaround is using Cygwin (or MSYS2) version of u-ctags instead of Win32 version. It handles the filenames in UTF-8.

On Japanese Windows, we can use Japanese characters, however, characters that cannot be represented by Shift_JIS (e.g. alphabets with diacritical mark, simplified Chinese characters, ...) cannot be used on Japanese Windows. Similarly, Japanese characters cannot be used on English Windows.

A workaround is using Cygwin (or MSYS2) version of u-ctags instead of Win32 version. It handles the filenames in UTF-8.

Hi k-takata,
this issue is fixed? if not, when will it be finished?

Regards,
Lennon

As I already said, it's very difficult to fix, and I don't have a plan to fix it yet.
If you really need it, please use Cygwin version of u-ctags for now.

Starting from Windows 10 1903, UTF-8 code page can be used by specifying application manifest file.
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
This might be able to solve the problem.

@k-takata, I think this is a kind of FAQ. How do you think?

Ah, maybe.

If #2360 is merged, it will be like this?

Q. Does Universal Ctags support Unicode file names?
A. Yes, Unicode file names are supported on unix-like platforms (Linux, macOS, Cygwin, etc.).
However, on Windows, you need to use Windows 10 version 1903 or later to use Unicode file names. (This is an experimental feature, though.)
On older versions on Windows, Universal Ctags only support file names represented in the current code page.
If you still want to use Unicode file names on them, use Cygwin or MSYS2 version of Universal Ctags as a workaround.

YES! THAK YOU VERY MUCH.

Your comment lets me realize how the ctags-faq.7.rst to be.

About C/C++ parser
===================================================
...

About ctags running on Windows
========================================
Q. Does Universal Ctags support Unicode file names?
A. Yes, Unicode file names are supported on unix-like platforms (Linux, macOS, Cygwin, etc.).
However, on Windows, you need to use Windows 10 version 1903 or later to use Unicode file names. (This is an experimental feature, though.)
On older versions on Windows, Universal Ctags only support file names represented in the current code page.
If you still want to use Unicode file names on them, use Cygwin or MSYS2 version of Universal Ctags as a workaround.

About ctags running on Windows

If the section is for Windows, the first and second sentences of the answer needs to be adjusted.

(edited)
E.g.

A. Partly yes. If you use Windows 10 version 1903 or later, Universal Ctags can use Unicode file names. (This is an experimental feature, though.)

This should be fixed by #2360 (on Windows 10 1903 or later).

@k-takata, you wrote:

This is a hard work, though.

However, it seems that you have written the code fixing for this issue a few day :-)
Maybe a correct sentnce is:

This is a hard work for you, though (but not for me).

Actually, Microsoft did a job, not me. ;-)
That's why this fix works only on Win10 1903 or later.

BTW, this fix has a restriction.
If we use Unicode APIs (as I suggested before), we can use 255 UTF-16 characters for file names.
However, with this fix, the maximum length of file names is limited to 255 bytes. (E.g. normal Japanese character is 3 bytes, so it is only 85 characters.)