opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strippable affix class regexes

ampli opened this issue · comments

I finished implementing and testinmg it, and here are the examples I used:

% TODO: this list should be expanded with other "typical"(?) junk
% that is commonly (?) in broken texts.
-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;
% Split on comma's, but be careful with numbers:
% "The enzyme has a weight of 125,000 to 130,000"
% Also split on colons, but be careful not to mess up time
% expressions: "The train arrives at 13:42"
"/(?<!\d)[,:]|[,:](?!\d)/": MPUNC+;

In corpus-fixes.batch:

% Test tokenization by affix regexes.
% Sentence that should not be affected.
The enzyme has a weight of 125,000 to 130,000
The train arrives at 13:42
% Sentences that use punctuation without a trailing whitespace.
We used the same colors (red,blue,yellow).
The price of this item:$100

LPUNC and RPUNC also support regexes, and I tested with them /^[[:punct:]]/ and /[[:punct:]]$/ (respectively) in amy.

However, there is a problem: It is supported only when configured with PCRE2, and when configured with C++ the lookbehind regex compilation fails (not supported by C++). POSIX regexes (C library and TRE) also fail. (This is not really a problem for amy etc. since we don't need to support other regex libraries there.)

Possible solutions:

  1. Distribute it with commented-out affix regexes and that's all.
  2. Use autoconf to enable PCRE2 regexes if configure with PCRE2.
  3. Add configuration file support for '#if SOMETHING' when SOMETHING is HAVE_POCRE2_H.
    4.Only support PCRE2 on POSIX systems. (BTW, it is now easy for me to add PCRE2 support on MS-Windows too.)
  4. Add support for regex library specification (easy to implement):
    "/(?<!\d)[,:]|[,:](?!\d)/PCRE2" (or even flag "e" for "extended").

I am for (5) and otherwise for (2) or (1).

I am for (5) and otherwise for (2) or (1).

EDIT: Fix the POSIX regex.

I found a better solution, that all the regex libraries support:
Instead of lookahead/lookbehind, use a capture group for the matching part.
e.g, instead of:
"/(?<!\d)[,:]|[,:](?!\d)/"
use a POSIX regex:
"/\d([,:]|[,:])\d/"

I will change the code to support this too.

EDIT yet again:
"/\D([,:]|[,:])\D/"

EDIT:
\D didn't work for me, but [^^d] did.

@linas,
To solve the split problem you pointed out in your comment on MPUNC, I implemented an MPUNC regex mechanism that uses lookahead/lookbehind (directly or indirectly) in a try not to split numbers with commas or times with colons. It works.

However, it seems there is a simpler solution that doesn't use a regex affix: Use : and , in MPUNC, and just don't MPUNC-split words that match a regex (in contrast to morpheme-split, that is done before trying a regex).
I will try to implement that, and for now, leave the use of MPUNC-regex for the sake of any/ady/amy (as a simple split on [[:punct:]]).

Another thing:
The corpus test sentences I used are not good enough: If they are not getting split as intended, they still parse fine, because the word with an internal colon or comma is looked up as UNKNOWN-WORD. But it is hard to find sentences that don't parse then. This is a general problem, that causes sentences with junk to get parsed just fine.
Does this need a solution?
If so, should we just have a regex category JUNK with no possible linkage, for words with junk in them?

I said above:

just don't MPUNC-split words that match a regex [...]
I will try to implement that, [...]

If a word contains 2 kinds of punctuations, one that has to be separated and one that should not, this way wouldn't work since the word could either match or not match a regex. So I will send the PR that splits by affix regexes.

EDIT:

just don't MPUNC-split words that match a regex [...]

This was a bad idea since in general such matches have nothing to do with word splits.