error loading link-parser for Thai

Question

error loading link-parser for Thai

Bancherd-DeLong opened this issue 2 years ago · comments

I received this error while trying to run "link-parser th"

link-grammar: Info: Dictionary found at ../data/th/4.0.dict
link-grammar: Error: Failed to compile regex: "THAI-NUMBERS: ^[๐-๙,.]*[๐-๙]$": Invalid collation character (code 3)
)Fatal error: Unable to open dictionary.

A few weeks ago, I was able to run "link-parser th" without problems on a different setup/machine.

Thank you for your assistance.

Amir Plivatsky · Answer 1 · Wed May 25 2022 09:53:12 GMT+0800 (China Standard Time)

This is a problem in the regex implementation of libc.
If you have built the library by yourself, try to install the pcre2 package and then build it again.

Alternatively, do the following changes in data/th/4.0.regex:

Replace all occurrences of ๐-๙ with ๐๑๒๓๔๕๖๗๘๙ (using [:digit:] instead doesn't work).
Replace all occurrences of ก-ฮ with [:alpha:] (this fixes a similar problem in THAI-PART-NUMBER).

(For fixing this in the distribution, instead of [:alpha:] the patch should list all the Thai alphabetical characters, since c++ regex is one of the configuration options (default for Windows), and it doesn't support Unicode in [:alpha:]. A better option may be adding support for defining/redefining character classes in 4.0.regex - this looks like simple addition.)

Amir Plivatsky · Answer 2 · Thu Jun 02 2022 03:18:06 GMT+0800 (China Standard Time)

@linas,
In order that all the regex libraries would work fine with Unicode ranges, it is possible to add to 4.0.regex definitions for character classes, e.g.:
[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/
etc.
(I can implement that if desired.)

Linas Vepštas · Answer 3 · Thu Jun 02 2022 05:51:08 GMT+0800 (China Standard Time)

In order that all the regex libraries would work fine with Unicode ranges,

I don't mind. The right person to talk to is @wannaphong and @kaamanita -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong and @kaamanita -- please try mainline, and please comment on this bug report!

Bancherd · Answer 4 · Thu Jun 02 2022 08:31:24 GMT+0800 (China Standard Time)

I know both of them personally. Will forward this note to them. Meanwhile, I might make the changes to my local copy. Thank you.

…

On Thu, Jun 2, 2022 at 4:51 AM Linas Vepštas ***@***.***> wrote: In order that all the regex libraries would work fine with Unicode ranges, I don't mind. The right person to talk to is @wannaphong <https://github.com/wannaphong> and @kaamanita <https://github.com/kaamanita> -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong <https://github.com/wannaphong> and @kaamanita <https://github.com/kaamanita> -- please try mainline, and please comment on this bug report! — Reply to this email directly, view it on GitHub <#1296 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHFWAQLJ4MEWKA6VMBGHTQTVM7LNPANCNFSM5WVEVAVA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Wannaphong Phatthiyaphaibun · Answer 5 · Thu Jun 02 2022 15:10:10 GMT+0800 (China Standard Time)

In order that all the regex libraries would work fine with Unicode ranges,

I don't mind. The right person to talk to is @wannaphong and @kaamanita -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong and @kaamanita -- please try mainline, and please comment on this bug report!

Hello. Now, I use mainline but I don't see the error.

/link-grammar# echo "ฉัน กิน ข้าว ๔ มื้อ| link-parser th
link-grammar: Info: Dictionary found at ./data/th/4.0.dict
link-grammar: Info: Dictionary version 5.10.4, locale th_TH.UTF-8
link-grammar: Info: Library version link-grammar-5.10.4. Enter "!help" for help.
Found 12 linkages (12 had no P.P. violations)
        Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=4)

    +-----LWs-----+
    |       +<-S<-+->O>+>NUnr>+->CLn->+
    |       |     |    |      |       |
LEFT-WALL ฉัน.pr กิน.v ข้าว.n ๔[!].nu มื้อ.cln

Bye.

Dockerfile

FROM ubuntu:focal
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
RUN apt-get update && apt-get install git build-essential python3-dev libpcre2-dev python-is-python3 libtre-dev wget automake locales libtool flex m4 autoconf-archive autoconf pkg-config swig libthai-dev help2man -y && rm -rf /var/lib/apt/lists/*
RUN (echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
     echo "ru_RU.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "he_IL.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "lt_LT.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "fa_IR.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "ar_AE.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "kk_KZ.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "tr_TR.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "th_TH.UTF-8 UTF-8" >> /etc/locale.gen)
RUN locale-gen en_US.UTF-8 ru_RU.UTF-8 he_IL.UTF-8 de_DE.UTF-8 lt_LT.UTF-8 fa_IR ar_AE.UTF-8 kk_KZ.UTF-8 tr_TR.UTF-8 th_TH.UTF-8
ENV LANG th_TH.UTF-8 
ENV LANGUAGE th_TH:th
ENV LC_ALL th_TH.UTF-8
ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages
RUN git clone https://github.com/opencog/link-grammar.git
WORKDIR link-grammar
RUN ./autogen.sh --disable-java-bindings --enable-python-bindings
RUN ./configure --disable-java-bindings --enable-python-bindings
RUN make
RUN make install
RUN ldconfig

Wannaphong Phatthiyaphaibun · Answer 6 · Thu Jun 02 2022 15:16:24 GMT+0800 (China Standard Time)

I think you may be missing some package when installation.

Prachya Boonkwan · Answer 7 · Thu Jun 02 2022 15:19:58 GMT+0800 (China Standard Time)

Hi all, It took me quite a while to finish investigating the issue. It seems that some Linux distributions come with incompatible regex libraries. I encountered the same issue when I tried installing it on my Ubuntu virtual machine. I then tried it on several other Linux distributions and got different results. I resolved the problem by upgrading PCRE2 to the latest version on all of these VMs. I agree with Bancherd and Wannaphong that the regex library might be the issue here. I'm also preparing a Dockerfile that generally solves the regex library for the Thai Link parser. It will also come with basic Thai NLP tools (word segmentation, POS tagging, NER, and sentence segmentation based on the Link parser). In the meantime, we have prepared a demo system for the parser at this link --> https://language-semantic.org/parser-demo/ . Note that the information extraction modules (Pred and SVO) are under development. Best regards, Prachya

…

On Thu, Jun 2, 2022 at 2:10 PM Wannaphong Phatthiyaphaibun < ***@***.***> wrote: In order that all the regex libraries would work fine with Unicode ranges, I don't mind. The right person to talk to is @wannaphong <https://github.com/wannaphong> and @kaamanita <https://github.com/kaamanita> -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong <https://github.com/wannaphong> and @kaamanita <https://github.com/kaamanita> -- please try mainline, and please comment on this bug report! Hello. Now, I use mainline but I don't see the error. /link-grammar# echo "ฉัน กิน ข้าว ๔ มื้อ| link-parser th link-grammar: Info: Dictionary found at ./data/th/4.0.dict link-grammar: Info: Dictionary version 5.10.4, locale th_TH.UTF-8 link-grammar: Info: Library version link-grammar-5.10.4. Enter "!help" for help. Found 12 linkages (12 had no P.P. violations) Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=4) +-----LWs-----+ | +<-S<-+->O>+>NUnr>+->CLn->+ | | | | | | LEFT-WALL ฉัน.pr <http://xn--92c6a6d.pr> กิน.v ข้าว.n ๔[!].nu มื้อ.cln Bye. Dockerfile FROM ubuntu:focal ARG DEBIAN_FRONTEND=noninteractive ENV TZ=Etc/UTC RUN apt-get update && apt-get install git build-essential python3-dev libpcre2-dev python-is-python3 libtre-dev wget automake locales libtool flex m4 autoconf-archive autoconf pkg-config swig libthai-dev help2man -y && rm -rf /var/lib/apt/lists/* RUN (echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \ echo "ru_RU.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "he_IL.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "lt_LT.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "fa_IR.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "ar_AE.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "kk_KZ.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "tr_TR.UTF-8 UTF-8" >> /etc/locale.gen && \ echo "th_TH.UTF-8 UTF-8" >> /etc/locale.gen) RUN locale-gen en_US.UTF-8 ru_RU.UTF-8 he_IL.UTF-8 de_DE.UTF-8 lt_LT.UTF-8 fa_IR ar_AE.UTF-8 kk_KZ.UTF-8 tr_TR.UTF-8 th_TH.UTF-8 ENV LANG th_TH.UTF-8 ENV LANGUAGE th_TH:th ENV LC_ALL th_TH.UTF-8 ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages RUN git clone https://github.com/opencog/link-grammar.git WORKDIR link-grammar RUN ./autogen.sh --disable-java-bindings --enable-python-bindings RUN ./configure --disable-java-bindings --enable-python-bindings RUN make RUN make install RUN ldconfig — Reply to this email directly, view it on GitHub <#1296 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AELWWW3IDTE2RNYKJJDNUNTVNBM55ANCNFSM5WVEVAVA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- +---------------+ | ------------ | Prachya Boonkwan, PhD | / / / | Language and Semantic Technology Lab (LST) | / / | National Electronic and Computer Technology Center (NECTEC) | / / | Pathumthani, Thailand | X A I P E T E | Tel. +66-(0)-2-564-6900 ext 2213 +---------------+

Linas Vepštas · Answer 8 · Thu Jun 02 2022 23:45:29 GMT+0800 (China Standard Time)

Hi all,

Please re-read @ampli's note. He points out that the problem can be solved by adding
[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/
etc...
to 4.0.regex. That way, it will work with any regex library. This seems like the best long-term solution to me: it avoids future issues where people have forgotten that they need to install a different regex library.

So I'd like to ask ampli to make these changes, unless one of you vetoes that? @kaamanita @wannaphong

Amir Plivatsky · Answer 9 · Fri Jun 03 2022 00:16:11 GMT+0800 (China Standard Time)

[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/

Please note that currently, the needed changes are as I pointed out in my replay to @Bancherd-DeLong, i.e. to use an explicit list of the needed Unicode character ranges:

Replace all occurrences of ๐-๙ with ๐๑๒๓๔๕๖๗๘๙
Replace all occurrences of ก-ฮ with กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮ

This is cumbersome but should work...

The quoted change is my proposal after adding a supported code for that to read-regex.c!

Amir Plivatsky · Answer 10 · Fri Jun 03 2022 00:50:37 GMT+0800 (China Standard Time)

I will send a PR so the problem would get solved by just applying it.

Linas Vepštas · Answer 11 · Wed Jun 22 2022 01:40:57 GMT+0800 (China Standard Time)

I think this is now fixed (in pull req #1297) and is now released as part of link-grammar-5.10.5 on the website. So .. can this bug be closed?

Bancherd · Answer 12 · Wed Jun 22 2022 08:13:45 GMT+0800 (China Standard Time)

I think this is now fixed (in pull req #1297) and is now released as part of link-grammar-5.10.5 on the website. So .. can this bug be closed?

Thank you very much!