sporkmonger / addressable

Addressable is an alternative implementation to the URI implementation that is part of Ruby's standard library. It is flexible, offers heuristic parsing, and additionally provides extensive support for IRIs and URI templates.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect normalzation behaviour on character sequence '%e2%80%b3'

gh2k opened this issue · comments

Specifically, this produces an incorrect result:

1.9.3-p392 :019 > u = Addressable::URI.parse('http://example.org/%e2%80%b3')
 => #<Addressable::URI:0xd005e8 URI:http://example.org/%e2%80%b3> 
1.9.3-p392 :020 > u.normalize!
 => #<Addressable::URI:0xd005e8 URI:http://example.org/%E2%80%B2%E2%80%B2> 

Note that the normalized URL no longer matches.

I think this is related to Addressable::IDNA.unicode_normalize_kc

Specifiaclly:

1.9.3-p392 :013 > s = Addressable::URI.unencode('%e2%80%b3')
 => "″" 
1.9.3-p392 :014 > Addressable::IDNA.unicode_normalize_kc(s)
 => "′′" 

The output is now two UTF-8 characters, when previously it was one.

This is not a bug. URIs, and particularly IRIs, use Unicode normalization form KC to eliminate visual ambiguities which may result in phishing attacks. NFKC splits that codepoint up to the characters that Addressable is giving you. If this behavior is undesirable for your use-case, you can normalize instead on a component-by-component basis.