Strippable affix class regexes

Question

Strippable affix class regexes

ampli opened this issue 2 years ago · comments

I finished implementing and testinmg it, and here are the examples I used:

% TODO: this list should be expanded with other "typical"(?) junk
% that is commonly (?) in broken texts.
-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;
% Split on comma's, but be careful with numbers:
% "The enzyme has a weight of 125,000 to 130,000"
% Also split on colons, but be careful not to mess up time
% expressions: "The train arrives at 13:42"
"/(?<!\d)[,:]|[,:](?!\d)/": MPUNC+;

In corpus-fixes.batch:

% Test tokenization by affix regexes.
% Sentence that should not be affected.
The enzyme has a weight of 125,000 to 130,000
The train arrives at 13:42
% Sentences that use punctuation without a trailing whitespace.
We used the same colors (red,blue,yellow).
The price of this item:$100

LPUNC and RPUNC also support regexes, and I tested with them /^[[:punct:]]/ and /[[:punct:]]$/ (respectively) in amy.

However, there is a problem: It is supported only when configured with PCRE2, and when configured with C++ the lookbehind regex compilation fails (not supported by C++). POSIX regexes (C library and TRE) also fail. (This is not really a problem for amy etc. since we don't need to support other regex libraries there.)

Possible solutions:

Distribute it with commented-out affix regexes and that's all.
Use autoconf to enable PCRE2 regexes if configure with PCRE2.
Add configuration file support for '#if SOMETHING' when SOMETHING is HAVE_POCRE2_H.
4.Only support PCRE2 on POSIX systems. (BTW, it is now easy for me to add PCRE2 support on MS-Windows too.)
Add support for regex library specification (easy to implement):
"/(?<!\d)[,:]|[,:](?!\d)/PCRE2" (or even flag "e" for "extended").

I am for (5) and otherwise for (2) or (1).

Amir Plivatsky · Answer 1 · Sun Jul 31 2022 22:40:18 GMT+0800 (China Standard Time)

I am for (5) and otherwise for (2) or (1).

EDIT: Fix the POSIX regex.

I found a better solution, that all the regex libraries support:
Instead of lookahead/lookbehind, use a capture group for the matching part.
e.g, instead of:
"/(?<!\d)[,:]|[,:](?!\d)/"
use a POSIX regex:
"/\d([,:]|[,:])\d/"

I will change the code to support this too.

EDIT yet again:
"/\D([,:]|[,:])\D/"

EDIT:
\D didn't work for me, but [^^d] did.

Amir Plivatsky · Answer 2 · Sun Jul 31 2022 22:47:46 GMT+0800 (China Standard Time)

@linas,
To solve the split problem you pointed out in your comment on MPUNC, I implemented an MPUNC regex mechanism that uses lookahead/lookbehind (directly or indirectly) in a try not to split numbers with commas or times with colons. It works.

However, it seems there is a simpler solution that doesn't use a regex affix: Use : and , in MPUNC, and just don't MPUNC-split words that match a regex (in contrast to morpheme-split, that is done before trying a regex).
I will try to implement that, and for now, leave the use of MPUNC-regex for the sake of any/ady/amy (as a simple split on [[:punct:]]).

Another thing:
The corpus test sentences I used are not good enough: If they are not getting split as intended, they still parse fine, because the word with an internal colon or comma is looked up as UNKNOWN-WORD. But it is hard to find sentences that don't parse then. This is a general problem, that causes sentences with junk to get parsed just fine.
Does this need a solution?
If so, should we just have a regex category JUNK with no possible linkage, for words with junk in them?

Amir Plivatsky · Answer 3 · Tue Aug 02 2022 08:02:29 GMT+0800 (China Standard Time)

I said above:

just don't MPUNC-split words that match a regex [...]
I will try to implement that, [...]

If a word contains 2 kinds of punctuations, one that has to be separated and one that should not, this way wouldn't work since the word could either match or not match a regex. So I will send the PR that splits by affix regexes.

EDIT:

just don't MPUNC-split words that match a regex [...]

This was a bad idea since in general such matches have nothing to do with word splits.