sporkmonger / addressable

Addressable is an alternative implementation to the URI implementation that is part of Ruby's standard library. It is flexible, offers heuristic parsing, and additionally provides extensive support for IRIs and URI templates.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect normalization of character sequence "%EF%BD%9E"

jaimeiniesta opened this issue · comments

Hello, we've found an issue on MetaInspector trying to normalize a japanese URL, here's how to reproduce it on Ruby 2.1.2 and Addressable 2.3.6:

require 'open-uri'
require 'addressable/uri'

url            = 'http://ja.wikipedia.org/wiki/Template:%EF%BD%9E'
normalized_url = Addressable::URI.parse(url).normalize.to_s

puts open(url).status            #=> 200
puts open(normalized_url).status #=> 404

This URL is being normalized to "http://ja.wikipedia.org/wiki/Template:~", which looks like, but it's not, what it should be:

captura de pantalla 2014-12-23 a las 22 24 26

This example screenshot is what you get when opening the URL in a browser (Chrome in my case):

Template:~ - Wikipedia

I'm not sure what it should be normalized to, but it looks like this character sequence should remain untouched, instead of being converted to "~".

Related:

#160

I believe that Wikipedia is the one that's wrong here. URI normalization requires Unicode normalization form NFKC, which tries to eliminate look-alike characters like this one as an explicit goal. I'm sure you're aware of the danger of phishing attacks and that's the primary concern involved in the choice of which normalization form to use.

When this gets normalized to '~', what downstream effect is that having for you and what are you trying to achieve with the normalization call? Sometimes path normalization is not appropriate, in which case you can use Addressable to normalize on a component-by-component basis.

Thans for the clarification @sporkmonger -- maybe you can have a look at this @hokaccha

Thanks for the clarification!