Kyubyong / name2nat

name2nat: a Python package for nationality prediction from a name

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

image image image

name2nat: a Python package for nationality prediction from a name

name2nat is a Python package that predicts the nationality of any name written in Roman letters. For example, it returns the correct output Korean for my name `Kyubyong Park'. Needless to say, it is not possible to guess somebody's nationality 100% right from their name. After all, nationality can change, you know. However, it is also true that there is a tendency between names and nationality. So it turns out statistical classifiers for this task works to some extent. Details are explained below.

Disclaimer

I am aware that this topic may be viewed from a political perspective. That is absolutely AGAINST my motivation.

NaNa Dataset

Construction

I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.

  • STEP 1. Downloaded and extracted the 20200601 English wiki dump (enwiki-20200601-pages-articles.xml).
  • STEP 2. Iterated all pages and collected the title and the nationality. I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangle), and identified their nationality from the most frequent nationality word in the section (red rectangles).

* STEP 3. Randomly split the data into train/dev/test in the ratio of 8:1:1 within each nationality group.

Stats

Nationality # Samples Train Dev Test
Total 1,112,902 890,248 111,286 111,368
Afghan 973 778 97 98
Albanian 2,742 2,193 274 275
Algerian 1,991 1,592 199 200
American 302,215 241,772 30,221 30,222
Andorran 236 188 24 24
Angolan 630 504 63 63
Argentine 11,158 8,926 1,116 1,116
Armenian 2,001 1,600 200 201
Aruban 117 93 12 12
Australian 50,670 40,536 5,067 5,067
Austrian 11,490 9,192 1,149 1,149
Azerbaijani 1,664 1,331 166 167
Bahamian 292 233 29 30
Bahraini 297 237 30 30
Bangladeshi 2,045 1,636 204 205
Barbadian 466 372 47 47
Basque 1,202 961 120 121
Belarusian 2,923 2,338 292 293
Belgian 9,884 7,907 988 989
Belizean 186 148 19 19
Beninese 249 199 25 25
Bermudian 338 270 34 34
Bhutanese 180 144 18 18
Bolivian 822 657 82 83
Bosniak 102 81 10 11
Botswana 315 252 31 32
Brazilian 14,043 11,234 1,404 1,405
Breton 148 118 15 15
British 57,403 45,922 5,740 5,741
Bruneian 144 115 14 15
Bulgarian 4,908 3,926 491 491
Burkinabé 362 289 36 37
Burmese 1,180 944 118 118
Burundian 175 140 17 18
Cambodian 451 360 45 46
Cameroonian 1,286 1,028 129 129
Canadian 42,691 34,152 4,269 4,270
Catalan 2,147 1,717 215 215
Chadian 174 139 17 18
Chilean 3,548 2,838 355 355
Chinese 11,868 9,494 1,187 1,187
Colombian 3,276 2,620 328 328
Comorian 68 54 7 7
Congolese 44 35 4 5
Cuban 2,423 1,938 242 243
Cypriot 1,271 1,016 127 128
Czech 9,056 7,244 906 906
Dane 41 32 4 5
Djiboutian 68 54 7 7
Dominican 1,976 1,580 198 198
Dutch 18,645 14,916 1,864 1,865
Ecuadorian 1,093 874 109 110
Egyptian 3,471 2,776 347 348
Emirati 777 621 78 78
English 96,449 77,159 9,645 9,645
Equatoguinean 242 193 24 25
Eritrean 167 133 17 17
Estonian 2,536 2,028 254 254
Ethiopian 917 733 92 92
Faroese 355 284 35 36
Filipino 4,910 3,928 491 491
Finn 85 68 8 9
French 51,052 40,841 5,105 5,106
Gabonese 226 180 23 23
Gambian 276 220 28 28
Georgian 328 262 33 33
German 52,986 42,388 5,299 5,299
Ghanaian 2,546 2,036 255 255
Gibraltarian 123 98 12 13
Greek 7,469 5,975 747 747
Grenadian 174 139 17 18
Guatemalan 704 563 70 71
Guinean 731 584 73 74
Guyanese 448 358 45 45
Haitian 702 561 70 71
Honduran 626 500 63 63
Hungarian 9,026 7,220 903 903
I-Kiribati 51 40 5 6
Indian 28,365 22,692 2,836 2,837
Indonesian 3,525 2,820 352 353
Iranian 6,263 5,010 626 627
Iraqi 1,566 1,252 157 157
Irish 14,806 11,844 1,481 1,481
Israeli 6,437 5,149 644 644
Italian 36,671 29,336 3,667 3,668
Jamaican 1,778 1,422 178 178
Japanese 26,520 21,216 2,652 2,652
Jordanian 613 490 61 62
Kazakh 31 24 3 4
Kenyan 2,012 1,609 201 202
Korean 9,871 7,896 987 988
Kuwaiti 496 396 50 50
Kyrgyz 20 16 2 2
Lao 33 26 3 4
Latvian 2,117 1,693 212 212
Lebanese 1,558 1,246 156 156
Liberian 368 294 37 37
Libyan 339 271 34 34
Lithuanian 2,474 1,979 247 248
Macedonian 1,374 1,099 137 138
Malagasy 290 232 29 29
Malawian 274 219 27 28
Malaysian 3,228 2,582 323 323
Maldivian 191 152 19 20
Malian 482 385 48 49
Maltese 829 663 83 83
Manx 188 150 19 19
Marshallese 40 32 4 4
Mauritanian 120 96 12 12
Mauritian 329 263 33 33
Mexican 10,810 8,648 1,081 1,081
Moldovan 1,250 1,000 125 125
Mongolian 631 504 63 64
Montenegrin 1,194 955 119 120
Moroccan 1,822 1,457 182 183
Mozambican 263 210 26 27
Namibian 736 588 74 74
Nauruan 40 32 4 4
Nepalese 967 773 97 97
Nicaraguan 357 285 36 36
Nigerian 5,075 4,060 507 508
Nigerien 179 143 18 18
Norwegian 16,891 13,512 1,689 1,690
Omani 247 197 25 25
Pakistani 4,703 3,762 470 471
Palauan 44 35 4 5
Palestinian 660 528 66 66
Panamanian 593 474 59 60
Paraguayan 1,266 1,012 127 127
Peruvian 1,902 1,521 190 191
Portuguese 5,918 4,734 592 592
Qatari 685 548 68 69
Romanian 8,189 6,551 819 819
Russian 26,593 21,274 2,659 2,660
Rwandan 337 269 34 34
Salvadoran 634 507 63 64
Sammarinese 248 198 25 25
Samoan 746 596 75 75
Saudi 1,871 1,496 187 188
Senegalese 1,029 823 103 103
Serb 56 44 6 6
Singaporean 1,646 1,316 165 165
Slovak 3,584 2,867 358 359
Slovene 111 88 11 12
Somali 145 116 14 15
Sotho 62 49 6 7
Sudanese 436 348 44 44
Surinamese 250 200 25 25
Swazi 143 114 14 15
Syriac 98 78 10 10
Syrian 1,309 1,047 131 131
Taiwanese 2,433 1,946 243 244
Tajik 77 61 8 8
Tamil 1,749 1,399 175 175
Tanzanian 784 627 78 79
Thai 3,434 2,747 343 344
Tibetan 332 265 33 34
Togolese 264 211 26 27
Tongan 570 456 57 57
Tunisian 1,340 1,072 134 134
Turk 99 79 10 10
Tuvaluan 83 66 8 9
Ugandan 1,316 1,052 132 132
Ukrainian 7,748 6,198 775 775
Uruguayan 2,834 2,267 283 284
Uzbek 78 62 8 8
Vanuatuan 146 116 15 15
Venezuelan 2,422 1,937 242 243
Vietnamese 1,572 1,257 157 158
Vincentian 10 8 1 1
Welsh 6,588 5,270 659 659
Yemeni 403 322 40 41
Zambian 638 510 64 64

Downloadable Link

  • You can download the dataset here.

name2nat

Installation

pip install name2nat

Usage

>>> from name2nat import Name2nat

>>> my_nanat = Name2nat()

>>> names = ["Donald Trump", # American
         "Moon Jae-in", # Korean
         "Shinzo Abe", # Japanese
         "Xi Jinping", # Chinese
         "Joko Widodo", # Indonesian
         "Angela Merkel", # German
         "Emmanuel Macron", # French
         "Kyubyong Park", # Korean
         "Yamamoto Yu", # Japanese
         "Jing Xu"] # Chinese
>>> result = my_nanat(names, top_n=3)
>>> print(result)
# (name, [(nationality, prob), ...])
# Note that prob of 1.0 indicates the name exists
# in Wikipedia.
[
('Donald Trump', [('American', 1.0)])
('Moon Jae-in', [('Korean', 1.0)])
('Shinzo Abe', [('Japanese', 1.0)])
('Xi Jinping', [('Chinese', 1.0)])
('Joko Widodo', [('Indonesian', 1.0)])
('Angela Merkel', [('German', 1.0)])
('Emmanuel Macron', [('French', 1.0)])
('Kyubyong Park', [('Korean', 0.9985014200210571), ('American', 0.000289416522718966), ('Bhutanese', 0.00025851925602182746)])
('Yamamoto Yu', [('Japanese', 0.7050493359565735), ('Taiwanese', 0.12779785692691803), ('Chinese', 0.04263153299689293)])
('Jing Xu', [('Chinese', 0.8626819252967834), ('Taiwanese', 0.09901007264852524), ('American', 0.022995812818408012)])
]

Training

I use a powerful NLP library Flair to train a text classifier model. A bidirectional GRU layer is employed.

python train.py

Evaluation

python predict.py;
python eval.py --gt nana/test.tgt --pred test.pred

Results

K Precision@K
1 61310/111368=55.1
2 77480/111368=69.6
3 86703/111368=77.9
4 92491/111368=83.0
5 96697/111368=86.8

Applications

Let's predict the nationalities of the first authors of the recent machine learning conferences.

  • Check conferences.py and conferences/lrec2020.md
  • Contributions (PRs) are welcome!

References

If you use this code for research, please cite:

@misc{park2018name2nat,
  author = {Park, Kyubyong},
  title = {name2nat: a Python package for nationality prediction from a name},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/name2nat}}
}

About

name2nat: a Python package for nationality prediction from a name

License:Apache License 2.0


Languages

Language:Python 100.0%