Broken collation order for Latin characters with diacritics
9p6 opened this issue · comments
Bug report
Describe the bug
Here is a clear and concise description of what the problem is:
It seems that sorting order for movie titles that contain Latin characters with diacritics is buggy. I'll describe the problem I discovered with Polish language but this very likely applies to other Latin based scripts.
Expected Behavior
Here is a clear and concise description of what was expected to happen:
Expected character order for Polish language:
a ą b c ć d e ę f g h i j k l ł m n ń o ó p q r s ś t u v w x y z ź ż
Actual Behavior
Actual order in movie title list (similar results in file manager)
ą a b ć c d e ę f g h i j k l m n ń ó o p q r ś s t u v w x y ż ź z ł
Most diacritics are inversed with relation to their base letter and in addition letter ł
is shifted to the end
Possible Fix
As locale-aware sorting for Unicode has been a long solved problem I figure Kodi must be using an in house solution for whatever needs may be that requires fixing.
To Reproduce
Steps to reproduce the behavior:
To generate a test case I used the code:
for x in a ą b c ć d e ę f g h i j k l ł m n ń o ó p q r s ś t u v w x y z ź ż; do touch "$x.mkv"; echo "<movie><title>$x</title></movie>" > "$x.nfo"; done
Scan the above files with local NFO scraper and then verify order on movie title list.
Debuglog
The debuglog can be found here:
https://paste.kodi.tv/welogokofi.kodi
Screenshots
Here are some links or screenshots to help explain the problem:
Additional context or screenshots (if appropriate)
Here is some additional context or explanation that might help:
Your Environment
Used Operating system:
-
Android
-
iOS
-
tvOS
-
Linux
-
macOS
-
Windows
-
Windows UWP
-
Operating system version/name: Ubuntu 22.04, LibreELEC 12
-
Kodi version: 20.2, 21
-
Locale:
C.utf8
,en_US.utf8
,pl_PL.utf8
(same results with different locales)
Thank you for using Kodi and our issue tracker. This is your friendly Kodi GitHub bot 😉
It seems that you have not followed the template we provide and require for all bug reports (or have opened a roadmap item by accident). Please understand that following the template is mandatory and required for the team to be able handle the volume of open issues efficiently.
Please edit your issue message to follow our template and make sure to fill in all fields appropriately. The issue will be closed after one week has passed without satisfactory follow-up from your side.
This is an automatically generated message. If you believe it was sent in error, please say so and a team member will remove the "Ignored rules" label.
A log captured with debug logging enabled is required for this.
I've attached the log from library scan process
From your log
info <general>: CLangInfo: loading resource.language.en_gb language information... debug <general>: trying to set locale to en_DE.UTF-8 info <general>: global locale set to C
You should have the default "c" collation.
The default C collation would have all diacritics shifted to the end
a b c d e f g h i j k l m n o p q r s t u v w x y z ó ą ć ę ł ń ś ź ż
so definitely it's something else. I don't know what to make of en_DE.UTF-8
in the log this seems random.
Forgot to mention that I did try starting Kodi with different locales en_US.utf8
, pl_PL.utf8
yet it doesn't change anything with the reported issue
LANG=en_US.utf8 kodi --debug