Incorrect normalization of character sequence "%EF%BD%9E"
jaimeiniesta opened this issue · comments
Hello, we've found an issue on MetaInspector trying to normalize a japanese URL, here's how to reproduce it on Ruby 2.1.2 and Addressable 2.3.6:
require 'open-uri'
require 'addressable/uri'
url = 'http://ja.wikipedia.org/wiki/Template:%EF%BD%9E'
normalized_url = Addressable::URI.parse(url).normalize.to_s
puts open(url).status #=> 200
puts open(normalized_url).status #=> 404
This URL is being normalized to "http://ja.wikipedia.org/wiki/Template:~", which looks like, but it's not, what it should be:
This example screenshot is what you get when opening the URL in a browser (Chrome in my case):
I'm not sure what it should be normalized to, but it looks like this character sequence should remain untouched, instead of being converted to "~".
Related:
I believe that Wikipedia is the one that's wrong here. URI normalization requires Unicode normalization form NFKC, which tries to eliminate look-alike characters like this one as an explicit goal. I'm sure you're aware of the danger of phishing attacks and that's the primary concern involved in the choice of which normalization form to use.
When this gets normalized to '~', what downstream effect is that having for you and what are you trying to achieve with the normalization call? Sometimes path normalization is not appropriate, in which case you can use Addressable to normalize on a component-by-component basis.
Thans for the clarification @sporkmonger -- maybe you can have a look at this @hokaccha
Thanks for the clarification!