Support for localized area names

Question

Support for localized area names

martinvonwittich opened this issue 5 years ago · comments

Currently, when using Number::Phone to determine the area name of a German phone number, it will return the English name instead of the localized name of the area. E.g. +49 89 would be "München" in German, but Number::Phone will return "Munich" instead, regardless of the current locale:

host ~ # export LC_ALL=de_DE.UTF-8
host ~ # perl -MNumber::Phone -e 'my $n = Number::Phone->new("+49891234567"); print $n->areaname, "\n"'
Munich

This is due to the fact that Number::Phone's build process always reads the area name out of libphonenumber/resources/geocoding/en/<areacode>.txt (e.g. libphonenumber/resources/geocoding/en/49.txt) and ignores the different locales (e.g. libphonenumber/resources/geocoding/de/49.txt):

host ~/src/perl-modules-Number-Phone (master) # grep '^4989|' libphonenumber/resources/geocoding/en/49.txt 
4989|Munich
host ~/src/perl-modules-Number-Phone (master) # grep '^4989|' libphonenumber/resources/geocoding/de/49.txt
4989|München

As we use Number::Phone in an Asterisk AGI script to put the area name into our (German) agents' phone displays, we wanted to have the German string instead. Back in 2016 when we first encountered this problem, we just dirtily hacked it directly into build-data.stubs:

diff --git a/build-data.stubs b/build-data.stubs
index e25cb8d..81a5cac 100755
--- a/build-data.stubs
+++ b/build-data.stubs
@@ -76,7 +76,12 @@ TERRITORY: foreach my $territory (@territories) {
       }
   }
   print $module_fh 'my '.Data::Dumper->new([$validators], [qw(validators)])->Dump();
-  my $codesfile = "libphonenumber/resources/geocoding/en/$IDD_country_code.txt";
+  my $codesfile;
+  if (-e "libphonenumber/resources/geocoding/de/$IDD_country_code.txt") {
+         $codesfile = "libphonenumber/resources/geocoding/de/$IDD_country_code.txt";
+  } else {
+         $codesfile = "libphonenumber/resources/geocoding/en/$IDD_country_code.txt";
+  }
   if($IDD_country_code == 1) {
     print $module_fh "use Number::Phone::NANP::Data;sub areaname { Number::Phone::NANP::Data::areaname('1'.shift()->{number}); }\n";
   } elsif(-e $codesfile) {

We then copied the resulting Number/Phone/StubCountry/DE.pm to /etc/perl, and so far this hack works pretty well. Unfortunately we failed to report this issue back then, which I now have done :)

David Cantrell · Answer 1 · Sat Jun 15 2019 00:08:14 GMT+0800 (China Standard Time)

Damn, someone caught my Anglo-centric laziness :-)

This is a very good point you raise, although it's a bit more complicated than just picking the right language for each country - Belgium, for example, has datasets available in Dutch, French, and German, and for country code 7 (Russia and Kazakhstan) data in Russian appears to only be available for Kazakh numbers(!).

My instinct is to default to English but if the user says that he wants it to use another language if data is available. I'll ask on the mailing list for people to comment here.

What's the best way of telling the library which language to use? Just go with the various locale environment variables? IIRC the rule is something like:

if $LANG or $LC_ALL == C - use English; otherwise ...
use the first of the colon-separated list of languages in $LANGUAGE that you can; otherwise ...
use the first two characters in $LC_ALL; otherwise ...
use English

martinvonwittich · Answer 2 · Sat Jun 15 2019 02:41:20 GMT+0800 (China Standard Time)

My instinct is to default to English but if the user says that he wants it to use another language if data is available. I'll ask on the mailing list for people to comment here.

Maybe the cleanest solution would be to introduce a new method, e.g. areaname_localized, and to keep areaname as is? I'm afraid that existing code might rely on areaname to return the unlocalized name, and I don't want to break anything :/

What's the best way of telling the library which language to use? Just go with the various locale environment variables? IIRC the rule is something like:

I18N::LangTags::Detect seems to be a simple solution for this, and it's even a core module :)

I've written a short proof of concept:

martin@dogmeat ~ % cat test2.pl 
#!/usr/bin/perl -CSDAL
use warnings;
use strict;
use I18N::LangTags;
use I18N::LangTags::Detect;

sub get_language($)
{
  my $cc = shift;

  my @languages =
      I18N::LangTags::implicate_supers(
        I18N::LangTags::Detect::detect());

  for my $language (@languages)
  {
    print "looking at language $language\n";

    if ($language =~ /^(\w{2,3})$/)
    {
      my $lang = lc $1;

      if (-f "libphonenumber/resources/geocoding/$lang/$cc.txt")
      {
        return $lang;
      }
    }
  }

  return "en";
}

print get_language "49", "\n";

Seems to do the right thing:

martin@dogmeat ~ % ./test2.pl
looking at language de-de
looking at language de
de
martin@dogmeat ~ % LC_ALL=C ./test2.pl
looking at language de-de
looking at language de
de
martin@dogmeat ~ % LANGUAGE= LC_ALL=C ./test2.pl
en
martin@dogmeat ~ % LANGUAGE= LANG=en_US.UTF-8 ./test2.pl
looking at language en-us
looking at language en
en
martin@dogmeat ~ % LANGUAGE=sv:de:en ./test2.pl
looking at language sv
sv

(It might be a bit surprising that LC_ALL=C ./test2.pl doesn't return en, but that seems to be intentional - "GNU gettext gives preference to LANGUAGE over LC_ALL and LANG for the purpose of message handling".)

Matching against /^(\w{2,3})$/ is most probably not even necessary; I believe that detect() will only return valid language tags, but inserting that into a path without checking it somehow makes me paranoid.

David Cantrell · Answer 3 · Sat Jun 15 2019 05:10:37 GMT+0800 (China Standard Time)

Ooh, that looks useful, thanks!

David Cantrell · Answer 4 · Thu Jul 11 2019 07:52:40 GMT+0800 (China Standard Time)

@martinvonwittich can you take a look at that pull request above? Most importantly, does the documentation in lib/Number/Phone.pm make sense and can you think of any more edge-cases I need to write tests for? And if you're able to, can you test that branch on your systems and see if it does what you need?

martinvonwittich · Answer 5 · Thu Jul 11 2019 23:55:05 GMT+0800 (China Standard Time)

Looks great, and yes, it does exactly what we need! Thank you for your work on this!

The only improvement idea I can come up with is to maybe add a note to the documentation that localized areanames may contain Unicode codepoints, and that it's necessary to enable proper Unicode support (e.g. with perl -CSDAL). Forgetting that may lead to confusion because Perl of course won't automatically convert Unicode codepoints to UTF-8 when Unicode support isn't enabled:

martin ~ # perl -MNumber::Phone -e 'my $n = Number::Phone->new("+49891234567"); print $n->areaname, "\n"'
Mnchen
martin ~ # perl -CSDAL -MNumber::Phone -e 'my $n = Number::Phone->new("+49891234567"); print $n->areaname, "\n"'
München

I bet that's a common trap for many people because Unicode mostly seems to work fine in Perl as long as all your input handles happens to give you valid UTF-8 bytes and all your output handles happen to expect UTF-8 bytes :)

David Cantrell · Answer 6 · Fri Jul 12 2019 17:57:22 GMT+0800 (China Standard Time)

Thanks. Users are already having to deal with the utf-8 shenanigans, as even in English some of the data includes non-ASCII characters. eg https://metacpan.org/source/DCANTRELL/Number-Phone-3.5001/lib/Number/Phone/StubCountry/DE.pm#L721

David Cantrell · Answer 7 · Sat Jul 13 2019 04:47:09 GMT+0800 (China Standard Time)

Merged, and will be in the next quarterly release in September.

David Cantrell · Answer 8 · Fri Sep 13 2019 18:37:48 GMT+0800 (China Standard Time)

FYI version 3.6000 is now on the CPAN: https://metacpan.org/pod/Number::Phone#areaname