mojaloop / mojaloop-specification

This repo contains the specification document set of the Open API for FSP Interoperability

Home Page:https://docs.mojaloop.io/api

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for accented characters in data type "Name"

mjbrichards opened this issue · comments

Describe the bug
In Section 7.2.4.1 of the API specification, the definition of the regular expression to parse a variable of the Name type states: "all Unicode32 characters are allowed". In fact, accented and non-Roman characters are rejected by the example regular expression given in Listing 14.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://regex101.com/
  2. Paste the regular expression in Listing 14 into the REGULAR EXPRESSION box
  3. Paste any text with accented characters into the TEST STRING box (e.g. "Côte d'Ivoire")
  4. MATCH INFORMATION box displays "Your regular expression does not match the subject string."

Expected behavior
MATCH INFORMATION box displays:

Match 1

Full match | 0-13 | Côte d'Ivoire

Desktop (please complete the following information):

  • Windows 10
  • Chrome
  • Version 80.0.3987.122 (Official Build) (64-bit)

Additional information:

Replace the regular expression in Listing 14 with:
^(?!\s*$)[(\p{L}|\p{Nd}) .,'-]{1,128}$
This uses the Unicode character groups for "any letter" and "any numeric digit"
Now the accented characters (and non-Roman scripts) are parsed correctly.

Why did you not include the note which is meant to deal with this? From Section 7.2.4.1:

Note: In some programming languages, Unicode support must be specifically enabled. For example, if Java is used the flag UNICODE_CHARACTER_CLASS must be enabled to allow Unicode characters.

There are other ways to include support for Unicode characters in other programming languages, for example in perl (the API Definition document uses perl's notation for regular expressions) you should use /u as a modifier. I'm assuming the problem comes from the open source switch, which is implemented in Javascript? Please see for example https://stackoverflow.com/questions/280712/javascript-unicode-regexes for more information regarding Javascript.

Please also note that https://regex101.com/ is not an official definition of how regular expressions works.

Because it doesn't deal with it. As the bug report says, it's independent of the implementation, and of the programming languages that might be used. As you rightly say, regex101 isn't an official definition of how regular expressions work. But it is an implementation-independent way of representing the problem. Are you saying that this isn't a problem? Because Mowali definitely think it is, and that it is unaffected by the flag you propose as a solution.

The flag mentioned in the note only deals with Java as an example, not other programming languages. The Switch (which I assume you mean with Mowali) is implemented in Javascript. Have you looked at the link that I provided?

OK, let me put this another way. Did you test with accented characters? and are you therefore confident that it's an implementation-dependent issue?

Yes, I have tested the regular expression in Java with accented characters using the flag mentioned in the note.

You can try the following in Java which should work fine (at least in Java 7 and 8, which are the only versions that I currently have installed on this machine):

   public static void main(final String[] args) {
        final Pattern p = Pattern.compile("^(?!\\s*$)[\\w .,'-]{1,128}$", Pattern.UNICODE_CHARACTER_CLASS);
        final Matcher matcher = p.matcher("Côte d'Ivoire");
        System.out.println(matcher.find());
    }

I looked at the link you provided. As far as I can see, it contains various suggestions for work-arounds in the content of the regex expression. These are equivalent, as far as I can see, to the form of solution that I proposed; but perhaps I have mistaken your intention in making the reference. Can you confirm?

As a general principle, I think that our aim should be that the content of the regex expression should not require language-specific work-arounds, if only because there is, at least in principle, a very large number of language instances for which we might need to provide specific suggestions, and I would very much prefer that the API should remain independent of the languages which might be used to implement it. I would rather see a regex which we are confident will work in any language context - which is both the reason why I proposed the solution I did and the thinking behind using a language-independent way of verifying the regex. Would you agree, or not?

@MichaelJBRichards, the problem is that the suggested regex will not work in every language context either. As an example for Javascript, support for Unicode property escapes was first added to ECMAScript 2018. I can happily agree that it would be great if we could find some regex that would work in every language out of the box without having the note to say that Unicode support should be enabled, but to my knowledge that is not possible.

@millerabel, your comment regarding the errors in the regex are correct, but not allowing a leading number would mean a breaking change for existing implementations (that have correctly understood the note in 7.2.4.1 that says that you need to enable support for Unicode), as leading numbers are currently allowed. As such, ^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$ would be the replacement if we should use Unicode property escapes instead of the existing note. See also comment above to Michael regarding support (or lack of support for Javascript before ES2018) for Unicode property escapes.

Hi @millerabel,

First let me just say that I'm very positive regarding your suggestions for making the regex stricter, I just don't want a stricter regex in the current major version (1) as it would introduce a change in what is allowed to be sent in the Name element. Some of our operators can sometimes be very creative when setting names and similar, meaning we would need to create scripts to find any matches against the new stricter regex to avoid any issues. Can we please open a new change request to improve the regex, which can then be incorporated in the next major version, as this specific issue is regarding a bug in the current version?

This is why I proposed ^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$, as that should be equivalent to the existing regex ^(?!\s*$)[\w .,'-]{1,128}$. The former has the requirement that Unicode property escapes are supported, the latter that you have programmatically enabled support for Unicode (for example by using the modifier /u from ES6 for Javascript, or the flag UNICODE_CHARACTER_CLASS from Java 7). By the way, using the /u modifier also makes the existing regex works in https://regex101.com/ for accented characters.

As using the modifier /u seems to be more backwards-compatible than Unicode property escapes, I'm actually leaning towards keeping the existing regex for now, but possibly also mentioning that in the existing note.

The API spec has many ambiguities that must be iteratively driven out.

I can't agree more. The more the API is used, the more ambiguities we will find that needs to be corrected.

The behavior allowed by the previous RE specification was never correct to begin with!

It was correct if you followed the note saying that you need to enable support for Unicode. This can be achieved in different ways depending on the programming language (and version), see above. The regex itself can be improved to make it stricter, but that is for another major version.

I’m noting another error in both my and your RE: They both permit trailing blanks which should not be allowed.

Trailing blanks are also allowed in the current regex. Not permitting trailing blanks would mean a breaking change, requiring a new major version.

By “UNICODE32” did you mean Unicode 3.2, a deprecated prior version of the specification, or did you mean UTF-32, which is an encoding format for Unicode (I hope you don’t mean this!)

If you read the original spec you will find that the 32 is a footnote. It is just a copy-paste error from the PDF-file to markdown which makes it look like it says Unicode32.

As is mentioned here in the RFC, just saying "all Unicode characters are allowed" is worse than saying "use ASCII." And in this case, is just wrong. All Unicode characters are not allowed.

You are entirely correct.

We should specify the specific minimum version of Unicode that implementers are expected to support. Support for Burmese (the Myanmar script) entered the Unicode specification at version 5.1.0

Sounds good!

We need a much tighter RE specification for what we mean by Name and a tighter English specification. Something like this:

Your suggestion sounds good on a quick read. The "Then remove any leading and trailing spaces.." part can then be removed when we introduce the stricter regex.

I hope I managed to answer on most of your comments.. Thank you for the detailed review of the type!

I agree with Miller. Since it seems clear that the use of specific regex expressions will lead us into potentially unwise assumptions about the technology used to implement the specification, would it not be simpler to give a clear English-language statement of the rules to be followed and the minimum standards (e.g. Unicode) that implementers must meet?

So I think I'd like to suggest that there should be a reference somewhere in the API specification (preferably in a table of such versioning references) to the Unicode release level. Within that, we should use references to the Unicode General Categories that we allow or prohibit in a field of a specific type. So we might rewrite Miller's proposal to say:

"Letters, both accented and unaccented, being chosen from all code points belonging to the Letter and Decimal_Number general categories as defined in the reference version of the Unicode specification (with link to reference.) In addition, the period (.), apostrophe ('), dash (-), comma (,) and space character are permitted. Interior spaces are allowed, but no leading or trailing spaces. For the avoidance of doubt, Names may include leading digits."

We can then allow implementers to decide how best to meet these requirements.

With regard to Miller's suggestion on the canonical form of a name, I wonder whether it might not be better to have a section which explains in general how to obtain the canonical form of any data item, and how to compare two instances of canonical names for equality. I don't have any quarrel with his statement of the method, but I suspect that we may find ourselves duplicating information if we make this part of the definition of the individual type.

I agree with Miller. Since it seems clear that the use of specific regex expressions will lead us into potentially unwise assumptions about the technology used to implement the specification, would it not be simpler to give a clear English-language statement of the rules to be followed and the minimum standards (e.g. Unicode) that implementers must meet?

Are you suggesting that we exclude the regular expression completely? I would very much like to have a good English-language statement of the rules, but I would also like to keep at least one of the regular expression versions that have been discussed in this issue in the specification as an example to follow the rules. Otherwise I think we risk ending up in different versions of the regular expressions that are not necessarily entirely compatible.

"Letters, both accented and unaccented, being chosen from all code points belonging to the Letter and Decimal_Number general categories as defined in the reference version of the Unicode specification (with link to reference.) In addition, the period (.), apostrophe ('), dash (-), comma (,) and space character are permitted. Interior spaces are allowed, but no leading or trailing spaces. For the avoidance of doubt, Names may include leading digits."

Some comments:

  • There is no "Decimal_Number general categor[y]" as far as I'm aware?
  • ", but no leading or trailing spaces", as I said in my earlier comment, this cannot be true for the current major version (1.x). We should not change to a stricter regex regarding what is allowed to be sent in a minor version. This can be added in the next major version.
  • A general comment (not really related to this text) is that there are other sections as well that should probably contain more detail. For example 7.2.11 and 7.2.12.

/^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$/u

Please note that you do not need the modifier /u when you are using \p{L}\p{Nd}. That is only required to enable Unicode when using \w, making the existing regular expression work as intended.

Spaces can’t be a prefix or suffix and will be stripped by any reasonable implementation. We should require that in the specification to remove the ambiguity that this RE otherwise implies. “Don’t send; tolerate but ignore on receive.” That’s the language of the resilient Internet.

Sounds good to me.

Please never use [0-9] when a Unicode human readable text is being parsed ...

Is this comment directed some earlier comment to this issue, to the specification, or just in general? As far as I know there are no instances of [0-9] in the specification meant for Unicode human readable text? Otherwise please point to where in the specification that you think this needs to be updated.

An important consideration is that when parsing machine-readable text that is assumed to be ASCII (such as in communication protocols, e.g. an IP address), ...

Same question as above.