delph-in / pydelphin

Python libraries for DELPH-IN

Home Page:https://pydelphin.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement REPP masking

goodmami opened this issue · comments

The latest ERG makes use of the new 'mask' operator (=) for REPP, as described in the email thread starting here:

http://lists.delph-in.net/archives/developers/2020/003107.html

Essentially, substrings matching a mask pattern are prevented from further modification. For example, the following masks email addresses such that later punctuation-splitting rules do not break up email addresses:

=<?[\p{L}\p{N}._-]+@[\p{L}\p{N}_-]+(?:\.[\p{L}\p{N}_-]+)*\.[\p{L}\p{N}]+>?

Masked sections can be tracked with a BIO sequential-tagging scheme so adjacent masks work even when content is inserted between them.