opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error loading link-parser for Thai

Bancherd-DeLong opened this issue · comments

I received this error while trying to run "link-parser th"

link-grammar: Info: Dictionary found at ../data/th/4.0.dict
link-grammar: Error: Failed to compile regex: "THAI-NUMBERS: ^[๐-๙,.]*[๐-๙]$": Invalid collation character (code 3)
)Fatal error: Unable to open dictionary.

A few weeks ago, I was able to run "link-parser th" without problems on a different setup/machine.

Thank you for your assistance.

This is a problem in the regex implementation of libc.
If you have built the library by yourself, try to install the pcre2 package and then build it again.

Alternatively, do the following changes in data/th/4.0.regex:

  1. Replace all occurrences of ๐-๙ with ๐๑๒๓๔๕๖๗๘๙ (using [:digit:] instead doesn't work).
  2. Replace all occurrences of ก-ฮ with [:alpha:] (this fixes a similar problem in THAI-PART-NUMBER).

(For fixing this in the distribution, instead of [:alpha:] the patch should list all the Thai alphabetical characters, since c++ regex is one of the configuration options (default for Windows), and it doesn't support Unicode in [:alpha:]. A better option may be adding support for defining/redefining character classes in 4.0.regex - this looks like simple addition.)

@linas,
In order that all the regex libraries would work fine with Unicode ranges, it is possible to add to 4.0.regex definitions for character classes, e.g.:
[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/
etc.
(I can implement that if desired.)

In order that all the regex libraries would work fine with Unicode ranges,

I don't mind. The right person to talk to is @wannaphong and @kaamanita -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong and @kaamanita -- please try mainline, and please comment on this bug report!

In order that all the regex libraries would work fine with Unicode ranges,

I don't mind. The right person to talk to is @wannaphong and @kaamanita -- Note also, that those guys seem to still be using their own branch of link-grammar, and that they have not realized that I've integrated all of their work into mainline. So ... hi @wannaphong and @kaamanita -- please try mainline, and please comment on this bug report!

Hello. Now, I use mainline but I don't see the error.

/link-grammar# echo "ฉัน กิน ข้าว ๔ มื้อ| link-parser th
link-grammar: Info: Dictionary found at ./data/th/4.0.dict
link-grammar: Info: Dictionary version 5.10.4, locale th_TH.UTF-8
link-grammar: Info: Library version link-grammar-5.10.4. Enter "!help" for help.
Found 12 linkages (12 had no P.P. violations)
        Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=4)

    +-----LWs-----+
    |       +<-S<-+->O>+>NUnr>+->CLn->+
    |       |     |    |      |       |
LEFT-WALL ฉัน.pr กิน.v ข้าว.n ๔[!].nu มื้อ.cln

Bye.

Dockerfile

FROM ubuntu:focal
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
RUN apt-get update && apt-get install git build-essential python3-dev libpcre2-dev python-is-python3 libtre-dev wget automake locales libtool flex m4 autoconf-archive autoconf pkg-config swig libthai-dev help2man -y && rm -rf /var/lib/apt/lists/*
RUN (echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
     echo "ru_RU.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "he_IL.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "lt_LT.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "fa_IR.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "ar_AE.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "kk_KZ.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "tr_TR.UTF-8 UTF-8" >> /etc/locale.gen && \
     echo "th_TH.UTF-8 UTF-8" >> /etc/locale.gen)
RUN locale-gen en_US.UTF-8 ru_RU.UTF-8 he_IL.UTF-8 de_DE.UTF-8 lt_LT.UTF-8 fa_IR ar_AE.UTF-8 kk_KZ.UTF-8 tr_TR.UTF-8 th_TH.UTF-8
ENV LANG th_TH.UTF-8 
ENV LANGUAGE th_TH:th
ENV LC_ALL th_TH.UTF-8
ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages
RUN git clone https://github.com/opencog/link-grammar.git
WORKDIR link-grammar
RUN ./autogen.sh --disable-java-bindings --enable-python-bindings
RUN ./configure --disable-java-bindings --enable-python-bindings
RUN make
RUN make install
RUN ldconfig

I think you may be missing some package when installation.

Hi all,

Please re-read @ampli's note. He points out that the problem can be solved by adding
[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/
etc...
to 4.0.regex. That way, it will work with any regex library. This seems like the best long-term solution to me: it avoids future issues where people have forgotten that they need to install a different regex library.

So I'd like to ask ampli to make these changes, unless one of you vetoes that? @kaamanita @wannaphong

[:thai-alpha:] : /[ก-ฮ]/
[:thai-digits:] : /[๐-๙]/

Please note that currently, the needed changes are as I pointed out in my replay to @Bancherd-DeLong, i.e. to use an explicit list of the needed Unicode character ranges:

  1. Replace all occurrences of ๐-๙ with ๐๑๒๓๔๕๖๗๘๙
  2. Replace all occurrences of ก-ฮ with กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮ

This is cumbersome but should work...

The quoted change is my proposal after adding a supported code for that to read-regex.c!

I will send a PR so the problem would get solved by just applying it.

I think this is now fixed (in pull req #1297) and is now released as part of link-grammar-5.10.5 on the website. So .. can this bug be closed?

I think this is now fixed (in pull req #1297) and is now released as part of link-grammar-5.10.5 on the website. So .. can this bug be closed?

Thank you very much!