Setting up a option (switch) to always show single character as preferred candidates before phrases.

Question

Setting up a option (switch) to always show single character as preferred candidates before phrases.

mike-fabian opened this issue 2 years ago · comments

See first comment of #103:

In addition, members of Fedora Chinese Users Group have wishes that:

Setting up a option (switch) to always show single character as preferred candidates before phrases.

Mike FABIAN commented 2 years ago

Ping 🏓

Mike FABIAN · Answer 1 · Fri Feb 11 2022 16:08:52 GMT+0800 (China Standard Time)

At the moment, ibus-table has an option in the setup tool to choose

Compose: [Phrase|Single Char]

If this is set to “Single Char”, only single characters ans no phrases are shown.

But if it is set to “Phrase”, then a mixture of single characters and phrases may be shown.

For example like this:

(Here I have increased the "debug level" in the ibus-table setup tool to show the priorities from the system database and the user database which are shown in green, the first column is the priority from the system database, the second is the priority from the user database, i.e. the number oftimes the user has typed this phrase/character)

Mike FABIAN · Answer 2 · Fri Feb 11 2022 16:11:24 GMT+0800 (China Standard Time)

The sorting of this candidate list is currently done like this:

https://github.com/mike-fabian/ibus-table/blob/master/engine/tabsqlitedb.py#L1018

        return sorted(candidates,
                      key=lambda x: (
                          - int(
                              typed_tabkeys == x[0]
                          ), # exact matches first!
                          pinyin_exact_match_function(x[0]),
                          -1*x[3],   # user_freq descending
                          -1*x[2],   # freq descending
                          len(x[0]), # len(tabkeys) ascending
                          x[0],      # tabkeys alphabetical
                          code_point_function(x[1][0]),
                          # Unicode codepoint of first character of phrase:
                          ord(x[1][0])
                      ))[:maximum_number_of_candidates]

Mike FABIAN · Answer 3 · Fri Feb 11 2022 16:36:34 GMT+0800 (China Standard Time)

So exact matches come first, then it is sorted by user frequency, then by system frequency, then by the lengths of the input string needed to type these characters, then by the input string alphabetical and finally by the code points as fallbacks.

I.e. in this example:

1 人 1002212710 0

comes first because it is an exact match for the typed input w.
This is the only exact match for w in the wubi-jidian table:

$ grep ^w\\s wubi-jidian86.txt 
w       人      1002212710

Now 你好

2 你好 qvb 70500000 4

has a lower system frequency (70500000) then 你 (1490000000) but 你好 has a higher user frequency (4) than 你 (3) so 你好 comes above 你 (when the dynamic adjust option is used which is the default for wubi-jidian).

3 你 qiy 1490000000 3
4 全国 glg 487000000 3

你 and 全国 both have the user frequency 4 and but the system frequency of 你 (1490000000) is higher so 你 comes before 全国.

And then 倆 has a user frequency (1) so it comes after the candidates
with higher user frequency (and also after 人 even though that has the user frequency 0 because 人 is an exact match of w):

5 倆 gmy 7840000 1

And finally 𥝈 with system frequency 0 and user frequency 0:

6 𥝈 ftc 0 1

Mike FABIAN · Answer 4 · Fri Feb 11 2022 18:24:47 GMT+0800 (China Standard Time)

Now I could make a change like this for example:

diff --git a/engine/tabsqlitedb.py b/engine/tabsqlitedb.py
index 3986075..1ee1a8f 100644
--- a/engine/tabsqlitedb.py
+++ b/engine/tabsqlitedb.py
@@ -1002,6 +1002,7 @@ class TabSqliteDb:
                               ), # exact matches first!
                               pinyin_exact_match_function(x[0]),
                               -1*x[3],   # user_freq descending
+                              len(x[1]),
                               # Prefer characters used in the
                               # desired Chinese variant:
                               -(bitmask
@@ -1021,6 +1022,7 @@ class TabSqliteDb:
                           ), # exact matches first!
                           pinyin_exact_match_function(x[0]),
                           -1*x[3],   # user_freq descending
+                          len(x[1]),
                           -1*x[2],   # freq descending
                           len(x[0]), # len(tabkeys) ascending
                           x[0],      # tabkeys alphabetical

x[1] contains the phrase, so inserting a sort key for the length of the phrase after the user frequency would prefer shorter phrases (single chars preferred over phrases with two chars, phrases with two chars preferred over phrases with 3 chars ...)

I checked what happend when inserting this len(x[1]) sort key before the user frequency: This seems quite useless because then you could just as well use the already existing option Compose: Single Char, it has basically the same effect then. For example when typing w, not phrases with more than one character will be in the first hundred matches, so only single characters will show up then, no matter whether the user has typed a certain phrase with more then one characters often before.

But inserting this len(x[1]) sort key after the user frequency but before the system frequency (and before the Chinese variant bitmask) might be useful.

This is what I did in the above experimental patch.

It would not change the order in the above screenshot though because the two character phrases 你好 and 全国 would still get the same position because of their higher user frequency, i.e. one would still get:

1 人 1002212710 0
2 你好 qvb 70500000 4
3 你 qiy 1490000000 3
4 全国 glg 487000000 3
5 倆 gmy 7840000 1
6 𥝈 ftc 0 1

But lets say the user frequencies are all the same (or the dynamic adjust option is off which means user frequencies are not used), then my experimental patch above gives this order when using the “Simplified Chinese first” option:

1 人 1002212710 0
2 你 qiy 1490000000 0
3 𥝈 ftc 0 0
4 倆 gmy 7840000 0
5 全国 glg 487000000 0
6 你好 qvb 70500000 0

(倆 comes after 𥝈 inspite of its higher system frequency because 倆 is marked as a character used only in Traditional Chinese and 𥝈 is not marked at all so it is assumed that it is for all variants of Chinese:

mfabian@taka:/local/mfabian/src/ibus-table (release-candidate-1.16.8 *$)
$ grep 倆 engine/chinese_variants.py 
    u'倆': 2,
mfabian@taka:/local/mfabian/src/ibus-table (release-candidate-1.16.8 *$)
$ grep 𥝈 engine/chinese_variants.py 
mfabian@taka:/local/mfabian/src/ibus-table (release-candidate-1.16.8 *$)
$

When using “Tradional Chinese first” one would get:

1 人 1002212710 0
2 你 qiy 1490000000 0
3 倆 gmy 7840000 0
4 𥝈 ftc 0 0
5 你好 qvb 70500000 0
6 全国 glg 487000000 0

Now 你好 would come before 全国 inspite of its lower system frequency because 国 is marked as a character only used in simplified Chinese:

mfabian@taka:/local/mfabian/src/ibus-table (release-candidate-1.16.8 *$)
$ grep 国 engine/chinese_variants.py 
    u'国': 1,
mfabian@taka:/local/mfabian/src/ibus-table (release-candidate-1.16.8 *$)
$

Mike FABIAN · Answer 5 · Fri Feb 11 2022 18:28:33 GMT+0800 (China Standard Time)

In my experimental patch above, I inserted the len(x[1]) sprt key unconditionally just to see what happens.

If this behaviour is useful, I could add an option like

🗹  Prefer shorter phrases

To enable this behaviour when this option is checked.

Is that what you want? Would that be useful?