Implement REPP masking
goodmami opened this issue · comments
Michael Wayne Goodman commented
The latest ERG makes use of the new 'mask' operator (=
) for REPP, as described in the email thread starting here:
http://lists.delph-in.net/archives/developers/2020/003107.html
Essentially, substrings matching a mask pattern are prevented from further modification. For example, the following masks email addresses such that later punctuation-splitting rules do not break up email addresses:
=<?[\p{L}\p{N}._-]+@[\p{L}\p{N}_-]+(?:\.[\p{L}\p{N}_-]+)*\.[\p{L}\p{N}]+>?
Masked sections can be tracked with a BIO sequential-tagging scheme so adjacent masks work even when content is inserted between them.