astickbanerjee / WikiTextCorpusDownloader

A Language Independent Wikipedia Text Corpus Downloader

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cloning

git clone https://github.com/Rajan-sust/WikiTextCorpusDownloader

Requirements

  • Python 3
  • Gensim
  • TensorFlow 2.0.0
pip install gensim tensorflow==2.0.0

Installation with virtual environment

If virtual environment is not installed, you can use this command.

pip install virtualenv

Change directory in cloned folder.

cd WikiTextCorpusDownloader

Create virtual environment and activate in WikiTextCorpusDownloader.

virtualenv --python python3 venv
source venv/bin/activate
pip install gensim tensorflow==2.0.0

Run

Run main.py with language flag

 python3 main.py --language=bn

For conveninece I've set the language flag bn. You can choose your language code from the following table.

Output

There will create a folder named language_corpus and all the txt file will be stored in language_corpus.

WikiTextCorpusDownloader/
├── bn_corpus
├── main.py
└── README.md

Language Code

Language Language (local) Wiki Code Articles Total Edits Admins Users Active Users Images Depth
English English en 5,957,940 48,851,073 918,220,332 1,154 37,454,878 133,972 883,255 974.22
Cebuano Cebuano ceb 5,378,851 9,743,883 29,548,951 6 62,083 155 0 2
Swedish svenska sv 3,747,946 7,760,405 46,384,811 58 695,764 2,643 0 6.85
German Deutsch de 2,357,400 6,589,124 192,070,538 189 3,303,857 18,693 131,132 93.93
French français fr 2,149,085 10,392,280 163,357,571 163 3,587,639 18,924 59,117 231.27
Dutch Nederlands nl 1,982,009 4,129,436 54,682,204 38 1,027,038 3,968 20 15.54
Russian русский ru 1,575,424 6,022,540 102,626,944 89 2,617,129 11,036 220,720 135.78
Italian italiano it 1,560,426 6,410,902 108,157,717 108 1,886,789 8,344 140,406 163.01
Spanish español es 1,552,670 6,825,081 120,285,131 69 5,596,433 17,608 0 203.22
Polish polski pl 1,366,455 3,108,214 57,596,259 102 988,726 4,150 273 30.11
Waray Winaray war 1,263,782 2,877,774 6,204,632 3 42,855 65 42 3.52
Vietnamese Tiếng Việt vi 1,237,354 15,377,620 55,922,731 21 700,988 1,881 31,882 474.93
Japanese 日本語 ja 1,174,111 3,479,762 74,550,080 39 1,547,059 13,910 85,744 82.62
Chinese 中文 zh 1,078,756 5,951,763 56,304,809 84 2,830,664 8,185 52,288 193.04
Portuguese português pt 1,014,859 4,913,949 56,373,530 75 2,316,501 5,815 53,699 169.34
Arabic العربية ar 967,340 5,765,464 39,110,875 25 1,737,253 4,929 33,236 166.9
Ukrainian українська uk 943,639 2,879,394 26,284,559 45 475,906 2,725 101,925 38.41
Persian فارسی fa 696,669 4,462,246 27,395,768 31 864,161 4,998 59,214 179.37
Catalan català ca 626,244 1,542,343 22,119,752 21 336,650 1,475 11,791 30.69
Serbian српски / srpski sr 625,620 3,819,423 22,221,350 18 251,916 827 32,372 151.62
Norwegian (Bokmål) norsk no 521,417 1,421,715 19,827,300 45 475,362 1,415 4 41.58
Indonesian Bahasa Indonesia id 506,623 2,654,166 16,058,756 36 1,100,928 3,058 60,379 108.72
Korean 한국어 ko 473,032 2,169,511 25,033,560 24 589,501 2,089 15,073 148.41
Finnish suomi fi 471,135 1,261,220 18,438,308 35 431,385 1,720 57,852 41.11
Hungarian magyar hu 458,848 1,325,315 21,793,115 31 427,015 1,831 14,383 58.64
Serbo-Croatian srpskohrvatski / српскохрватски sh 450,177 4,619,978 40,902,823 11 140,404 182 10,051 759.59
Czech čeština cs 437,419 1,217,337 17,720,831 32 470,778 2,268 1 46.28
Romanian română ro 401,675 2,157,968 13,076,247 18 494,974 943 37,793 115.85
Basque euskara eu 341,653 687,367 7,114,515 12 104,864 340 0 10.6
Turkish Türkçe tr 335,060 1,673,038 21,023,548 23 1,046,927 687 32,739 200.38
Malay Bahasa Melayu ms 330,732 912,992 4,645,251 13 244,660 477 21,628 15.77
Esperanto Esperanto eo 269,886 596,903 6,746,374 19 163,925 335 6,222 16.59
Armenian հայերեն hy 259,379 860,902 6,582,511 11 90,446 713 7,881 41.12
Bulgarian български bg 256,834 570,130 9,627,555 27 259,132 697 590 25.13
Chechen нохчийн ce 254,134 517,141 3,766,403 3 21,088 46 324 7.8
Danish dansk da 254,048 839,870 10,073,437 24 365,020 882 0 63.78
Hebrew עברית he 252,321 1,068,468 26,505,066 37 568,066 2,683 60,060 259.54
Slovak slovenčina sk 231,418 511,373 6,905,258 9 179,326 543 0 19.76
Min Nan Bân-lâm-gú zh-min-nan 228,395 742,394 1,933,304 5 40,162 69 362 13.19
Kazakh қазақша kk 225,444 541,106 2,716,729 18 91,008 224 9,686 9.84
Minangkabau Minangkabau min 223,245 324,498 1,956,620 5 9,908 156 111 1.24
Croatian hrvatski hr 210,897 525,261 5,332,860 16 226,144 515 17,752 22.56
Estonian eesti et 202,340 487,832 5,466,187 35 132,266 609 1,126 22.31
Lithuanian lietuvių lt 196,979 467,646 5,644,565 13 136,856 337 22,643 22.79
Belarusian беларуская be 178,981 557,198 3,463,186 10 89,914 249 2,949 27.75
Greek Ελληνικά el 169,514 506,536 7,830,330 21 286,086 985 17,444 61.1
Slovenian slovenščina sl 165,756 397,615 5,200,172 22 183,901 347 8,441 25.59
Galician galego gl 159,406 427,419 5,273,688 9 102,054 258 9,311 34.88
Southern Azerbaijani تۆرکجه azb 152,184 443,991 1,144,865 4 20,224 94 0 9.48
Azerbaijani azərbaycanca az 150,724 392,645 4,894,365 16 182,739 622 22,727 32.11
Simple English Simple English simple 150,633 521,223 6,676,120 17 896,149 793 36 77.53
Norwegian (Nynorsk) norsk nynorsk nn 150,452 347,461 3,164,351 20 99,889 170 16 15.62
Urdu اردو ur 149,403 834,250 3,977,948 11 102,089 279 10,066 100.19
Hindi हिन्दी hi 133,789 928,717 4,350,046 9 497,185 1,540 4,134 165.36
Thai ไทย th 133,694 814,819 8,537,626 14 367,411 1,239 34,798 271.96
Uzbek oʻzbekcha/ўзбекча uz 132,476 628,756 2,060,739 10 47,634 167 1,095 46
Georgian ქართული ka 132,358 384,683 3,735,156 4 114,786 237 14,270 35.29
Latin Latina la 131,382 258,556 3,473,631 18 124,476 142 0 12.59
Tamil தமிழ் ta 123,911 380,432 2,816,895 41 162,577 411 7,695 31.73
Volapük Volapük vo 123,498 253,831 3,204,335 3 26,888 35 0 14.06
Welsh Cymraeg cy 106,299 233,311 9,112,539 13 57,619 130 17,632 55.76
Macedonian македонски mk 102,616 460,583 3,877,303 17 84,335 245 7,891 102.44
Asturian asturianu ast 99,688 155,147 2,550,012 9 61,797 138 0 5.09
Tajik тоҷикӣ tg 98,750 184,947 1,044,678 6 26,940 54 419 4.3
Latvian latviešu lv 98,593 399,578 3,118,951 12 86,632 241 23,819 72.75
Malagasy Malagasy mg 92,136 236,048 975,070 3 21,353 32 3 10.08
Tatar татарча/tatarça tt 87,531 208,943 2,395,833 3 31,506 62 4,632 22.06
Occitan occitan oc 86,441 145,139 2,133,954 5 39,327 88 900 6.78
Afrikaans Afrikaans af 86,239 262,521 2,055,641 14 115,856 192 9,052 32.72
Bosnian bosanski bs 80,922 372,472 3,043,634 9 118,345 191 24,272 106.07
Kirghiz Кыргызча ky 79,736 105,425 351,575 3 23,821 58 2,679 0.35
Albanian shqip sq 77,895 229,643 2,043,751 14 117,441 241 11,389 33.78
Tagalog Tagalog tl 75,885 234,022 1,719,212 12 102,537 112 1,908 31.9
Bengali বাংলা bn 75,469 708,549 3,755,854 12 251,740 1,218 6,773 373.01
Cantonese 粵語 zh-yue 74,635 190,848 1,337,193 11 187,259 257 1,655 16.99
Newar नेपाल भाषा new 72,231 196,454 845,210 2 20,809 16 0 12.73
Telugu తెలుగు te 71,379 260,934 2,751,592 15 90,684 178 12,536 74.37
Belarusian (Taraškievica) беларуская (тарашкевіца)‎ be-tarask 68,177 188,834 2,087,604 4 61,316 118 1,460 34.63
Breton brezhoneg br 67,284 131,423 1,847,714 5 56,027 78 5,401 12.78
Malayalam മലയാളം ml 66,117 410,068 3,180,273 21 128,969 340 6,105 209.88
Piedmontese Piemontèis pms 64,526 99,074 844,567 10 20,952 28 2,078 2.44
Sundanese Sunda su 59,574 92,453 592,152 8 22,812 70 547 1.95
Low Saxon Plattdüütsch nds 57,997 118,503 892,275 4 37,423 50 0 8.2
Luxembourgish Lëtzebuergesch lb 57,318 123,398 2,213,050 5 45,103 80 2,596 23.84
Javanese Jawa jv 56,707 152,687 1,494,771 8 41,489 156 5,446 28.05
Haitian Kreyòl ayisyen ht 56,693 69,416 732,795 1 21,946 28 0 0.53
Marathi मराठी mr 55,254 227,017 1,707,660 9 112,754 234 19,186 72.69
Scots Scots sco 55,248 200,374 717,992 6 63,071 104 1,601 24.72
Swahili Kiswahili sw 54,308 112,747 1,086,727 11 38,610 111 2,226 11.16
Silesian ślůnski szl 51,653 63,231 330,705 3 16,576 38 0 0.26
Irish Gaeilge ga 51,651 84,098 942,895 8 40,786 115 1,169 4.42
Bashkir башҡортса ba 49,951 141,407 913,475 9 26,131 73 1,364 21.66
Western Punjabi پنجابی pnb 49,162 78,410 534,596 3 24,127 35 238 2.41
Icelandic íslenska is 48,474 128,780 1,648,612 28 69,863 149 3,060 35.14
Burmese မြန်မာဘာသာ my 45,053 111,955 488,746 5 69,745 358 2,857 9.63
West Frisian Frysk fy 42,910 130,172 975,047 9 34,735 71 6,918 30.98
Chuvash Чӑвашла cv 42,208 75,432 660,461 3 25,243 45 536 5.43
Lombard lumbaart lmo 39,059 100,944 990,933 7 27,443 34 4,437 24.64
Aragonese aragonés an 36,117 111,538 1,640,703 6 54,407 69 1,282 64.15
Nepali नेपाली ne 33,302 92,463 750,820 8 44,404 105 889 25.63
Eastern Punjabi ਪੰਜਾਬੀ pa 32,047 112,479 491,976 9 30,182 121 1,399 27.55
Yoruba Yorùbá yo 31,950 54,993 520,314 3 19,343 52 172 4.92
Bavarian Boarisch bar 30,211 108,848 728,671 7 49,141 73 1,342 45.36
Ido Ido io 28,866 43,320 951,170 5 27,385 39 1 5.51
Gujarati ગુજરાતી gu 28,648 95,171 678,601 4 52,900 101 0 38.45
Alemannic Alemannisch als 26,618 62,477 928,718 9 72,741 81 496 26.98
Kurdish (Kurmanji) kurdî ku 26,309 64,771 738,757 4 38,855 49 580 24.38
Sicilian sicilianu scn 26,060 55,242 726,144 8 33,518 33 1,408 16.48
Kannada ಕನ್ನಡ kn 25,162 109,500 944,679 5 58,742 214 3,459 96.92
Bishnupriya Manipuri বিষ্ণুপ্রিয়া মণিপুরী bpy 25,081 58,450 782,141 2 19,830 28 49 23.69
Kurdish (Sorani) کوردی ckb 24,830 130,597 643,744 7 37,043 89 955 89.44
Wu 吴语 wuu 22,215 34,270 255,930 4 61,781 37 238 2.2
Interlingua interlingua ia 21,906 34,892 615,006 7 35,707 38 4 6.19
Egyptian Arabic مصرى arz 21,756 185,491 904,598 6 116,951 109 1,457 276.22
Quechua Runa Simi qu 21,490 53,469 637,975 3 22,165 36 0 26.42
Mongolian монгол mn 18,932 74,632 593,077 5 60,907 99 1,450 68.79
Samogitian žemaitėška bat-smg 16,804 28,335 348,719 5 20,246 29 109 5.8
Sinhalese සිංහල si 15,409 71,115 441,801 3 42,687 79 4,557 81.19
Walloon walon wa 15,374 39,561 350,019 2 17,652 16 2,143 21.9
Min Dong Mìng-dĕ̤ng-ngṳ̄ cdo 15,289 30,378 84,384 4 15,186 16 4 2.71
Odia ଓଡ଼ିଆ or 15,159 64,809 375,691 6 23,022 72 126 62.19
Yiddish ייִדיש yi 14,881 41,430 545,603 3 35,299 45 1,062 41.92
Scottish Gaelic Gàidhlig gd 14,834 30,795 550,033 4 21,011 32 347 20.68
Amharic አማርኛ am 14,786 45,275 357,401 3 29,987 30 1,745 33.56
Neapolitan Napulitano nap 14,554 23,321 659,672 3 22,095 26 286 10.26
Buginese ᨅᨔ ᨕᨘᨁᨗ bug 14,125 18,697 192,349 1 10,052 11 0 1.08
Ilocano Ilokano ilo 13,575 51,934 343,096 2 13,381 18 0 52.75
Maithili मैथिली mai 13,458 33,981 199,912 5 7,896 35 104 13.68
Upper Sorbian hornjoserbsce hsb 13,437 34,272 366,147 4 19,107 27 138 25.69
Banyumasan Basa Banyumasan map-bms 13,327 28,928 208,316 1 11,979 19 483 9.87
Faroese føroyskt fo 13,210 39,021 356,283 4 21,770 33 0 34.86
Mingrelian მარგალური xmf 13,191 28,877 139,056 3 12,924 22 0 6.81
Mazandarani مازِرونی mzn 13,122 28,727 149,464 4 21,153 27 266 7.36
Limburgish Limburgs li 12,628 62,342 437,516 7 21,026 37 624 108.77
Venetian vèneto vec 12,435 36,766 604,318 3 24,499 37 723 62.93
Sindhi سنڌي sd 12,284 35,964 156,075 3 11,011 36 66 16.13
Emilian-Romagnol emiliàn e rumagnòl eml 12,152 31,995 132,489 3 17,508 21 2,225 11.04
Sakha саха тыла sah 12,059 41,823 347,686 4 17,509 31 1,766 50.64
Zazaki Zazaki diq 11,767 30,559 397,162 6 18,829 67 215 33.15
Ossetian Ирон os 11,715 44,946 474,974 3 18,801 22 179 85.03
Sanskrit संस्कृतम् sa 11,418 59,339 446,804 5 27,265 35 437 132.63
Pashto پښتو ps 10,840 39,529 244,127 3 20,174 45 1,500 43.26
Hill Mari кырык мары mrj 10,268 16,587 95,426 1 8,236 18 0 2.18
Meadow Mari олык марий mhr 10,055 24,325 177,206 1 10,679 23 0 14.67
Classical Chinese 文言 zh-classical 9,985 80,949 340,565 6 79,910 59 0 212.5
Fiji Hindi Fiji Hindi hif 9,773 34,454 251,774 2 23,057 37 193 46.61
Navajo Diné bizaad nv 9,620 21,112 224,126 4 11,966 9 541 15.15
Central Bicolano Bikol Central bcl 9,257 17,089 185,018 2 14,627 25 880 7.75
Tarantino tarandíne roa-tara 9,248 17,380 135,744 2 9,023 10 290 6.04
North Frisian Nordfriisk frr 9,237 26,264 172,262 5 13,207 26 974 22.29
Acehnese Acèh ace 9,237 16,477 115,310 2 18,913 30 0 4.3
Hakka 客家語/Hak-kâ-ngî hak 9,232 17,730 116,743 1 25,131 21 0 5.58
Kapampangan Kapampangan pam 8,633 18,632 281,758 2 16,247 21 412 20.29
Northern Sotho Sesotho sa Leboa nso 8,173 10,073 39,222 1 4,332 13 0 0.21
Khmer ភាសាខ្មែរ km 7,741 29,011 236,532 5 28,028 59 1,196 61.56
Northern Sami davvisámegiella se 7,554 18,956 291,618 5 20,932 36 0 35.05
Rusyn русиньскый rue 7,281 14,807 114,758 1 18,493 16 0 8.28
Maori Māori mi 7,153 12,803 148,593 2 11,251 14 0 7.24
West Flemish West-Vlams vls 6,978 19,693 302,127 5 19,278 25 500 50.94
Nahuatl Nāhuatl nah 6,968 18,370 447,773 3 17,394 16 175 65.27
Bhojpuri भोजपुरी bh 6,933 54,845 661,665 2 19,738 25 56 576.17
Dutch Low Saxon Nedersaksies nds-nl 6,845 17,110 303,094 6 19,867 20 611 39.84
Crimean Tatar qırımtatarca crh 6,719 22,337 142,464 2 13,449 37 0 34.46
Gan 贛語 gan 6,426 33,323 392,146 4 32,873 21 146 206.17
Vepsian vepsän kel’ vep 6,196 23,005 119,770 1 11,227 21 0 38.32
Sardinian sardu sc 6,073 13,578 160,830 4 16,302 30 132 18.09
Assamese অসমীয়া as 6,014 46,660 207,798 5 24,104 89 1,169 203.43
Abkhazian Аҧсшәа ab 5,976 17,185 85,161 2 13,885 26 20 17.43
Gilaki گیلکی glk 5,912 12,265 53,377 3 11,535 19 807 5.03
Tibetan བོད་ཡིག bo 5,860 16,958 136,753 1 20,459 24 0 28.92
Erzya эрзянь myv 5,772 16,654 115,410 3 9,320 24 0 24.63
Corsican corsu co 5,689 12,932 363,967 2 15,654 28 0 45.62
Somali Soomaaliga so 5,663 20,164 196,107 1 23,944 85 0 63.77
Turkmen Türkmençe tk 5,563 13,174 209,910 3 18,164 27 308 29.83
Võro Võro fiu-vro 5,515 10,427 169,070 3 10,610 16 211 12.86
Northern Luri لۊری شومالی lrc 5,452 9,401 119,925 2 4,084 15 0 6.69
Komi коми kv 5,322 14,404 133,142 1 10,717 15 0 26.92
Kashubian kaszëbsczi csb 5,321 8,509 182,458 3 12,578 15 0 7.7
Manx Gaelg gv 4,985 16,882 298,086 3 15,058 15 181 100.57
Shona chiShona sn 4,829 11,720 69,079 1 10,331 21 0 12
Udmurt удмурт udm 4,716 14,335 113,139 5 11,092 19 9 32.83
Zeelandic Zeêuws zea 4,680 8,842 109,709 5 9,528 18 1 9.81
Aymara Aymar aru ay 4,641 8,210 90,698 1 12,620 21 0 6.53
Interlingue Interlingue ie 4,610 7,460 122,543 1 12,922 22 0 6.28
Picard Picard pcd 4,577 9,426 63,782 2 11,329 25 52 7.59
Norman Nouormand nrm 4,303 9,502 211,447 1 10,242 15 0 32.49
Kabyle Taqbaylit kab 4,267 11,737 92,178 2 9,115 21 0 24.07
Uyghur ئۇيغۇرچە / Uyghurche ug 4,161 12,210 145,845 1 15,395 16 291 44.7
Lezgian лезги lez 4,010 10,913 79,018 6 7,457 25 10 21.46
Saterland Frisian Seeltersk stq 4,009 10,421 118,206 4 10,368 13 442 29.02
Hausa Hausa ha 3,940 8,920 52,964 2 9,613 34 0 9.49
Cornish kernowek kw 3,900 8,034 174,862 1 10,878 15 0 24.46
Mirandese Mirandés mwl 3,767 9,940 95,635 3 10,007 20 0 25.84
Konkani गोंयची कोंकणी / Gõychi Konknni gom 3,719 8,294 181,208 3 5,378 11 0 33.06
Guarani Avañe'ẽ gn 3,714 8,850 109,182 2 12,921 22 0 23.59
Hawaiian Hawaiʻi haw 3,708 6,478 89,205 1 11,515 16 0 7.68
Romansh rumantsch rm 3,630 8,723 156,830 3 14,808 27 50 35.39
Ligurian Ligure lij 3,606 15,325 164,020 3 10,646 24 0 113.04
Lingua Franca Nova Lingua Franca Nova lfn 3,589 5,608 31,421 2 3,808 21 0 ——
Ladino Ladino lad 3,535 12,702 204,792 5 16,077 22 23 108.42
Lao ລາວ lo 3,502 10,788 83,005 1 11,775 24 0 33.3
Komi-Permyak Перем Коми koi 3,452 8,910 55,772 1 6,431 9 0 15.65
Maltese Malti mt 3,425 15,441 253,223 4 16,146 28 1,149 201.85
Franco-Provençal arpetan frp 3,381 8,730 189,906 2 11,550 22 0 54.45
Friulian furlan fur 3,336 7,817 166,472 2 10,828 18 318 38.42
Lower Sorbian dolnoserbski dsb 3,245 10,874 137,198 1 14,390 21 0 69.74
Doteli डोटेली dty 3,218 15,410 197,425 3 3,359 27 3 ——
Extremaduran estremeñu ext 3,152 7,155 110,612 1 13,213 19 0 24.93
Anglo-Saxon Ænglisc ang 3,143 14,904 196,234 2 102,208 36 300 184.36
Livvi-Karelian Livvinkarjala olo 3,128 7,612 25,988 2 4,010 20 0 ——
Lingala lingála ln 3,127 8,208 119,329 3 9,353 11 31 38.38
Chavacano Chavacano de Zamboanga cbk-zam 3,007 5,517 99,903 2 11,331 13 0 12.62
Divehi ދިވެހިބަސް dv 3,001 10,587 121,244 2 19,846 16 933 73.18
Banjar Banjar bjn 2,867 16,014 64,274 2 9,581 14 1 84.4
Ripuarian Ripoarisch ksh 2,858 10,347 1,600,668 3 17,999 18 0 1062.21
Gagauz Gagauz gag 2,722 6,258 63,687 1 9,250 12 0 17.17
Palatinate German Pälzisch pfl 2,547 6,603 84,052 4 8,071 16 0 32.28
Pali पालि pi 2,540 4,468 96,392 1 5,239 5 0 12.43
Pangasinan Pangasinan pag 2,528 7,693 66,358 1 5,959 9 0 36.01
Avar авар av 2,348 9,865 69,701 1 10,595 18 0 72.42
Buryat буряад bxr 2,162 7,888 54,210 2 10,922 19 0 48.21
Gorontalo Bahasa Hulontalo gor 2,150 4,386 22,959 3 1,131 11 0 ——
Kalmyk хальмг xal 2,082 10,864 80,817 1 7,668 12 0 132.35
Karachay-Balkar къарачай-малкъар krc 2,031 14,104 104,984 2 7,770 10 0 263.02
Zhuang Vahcuengh za 1,932 4,067 38,563 1 7,769 7 0 11.58
Papiamentu Papiamentu pap 1,916 4,658 72,031 2 9,570 15 0 31.67
Karakalpak Qaraqalpaqsha kaa 1,877 4,785 42,829 2 8,241 12 0 21.48
Pennsylvania German Deitsch pdc 1,871 5,655 102,992 1 25,072 14 0 74.49
Tuvan тыва дыл tyv 1,830 7,289 29,939 1 5,751 16 0 36.55
Kinyarwanda Kinyarwanda rw 1,824 5,164 71,681 1 8,303 11 0 46.54
Tongan lea faka-Tonga to 1,702 4,986 39,504 2 6,526 10 11 29.5
Greenlandic kalaallisut kl 1,671 3,992 72,052 2 10,028 15 0 34.82
Novial Novial nov 1,666 4,518 174,741 2 8,623 8 0 113.34
Jamaican Patois Patois jam 1,643 2,889 19,706 1 4,672 19 0 3.92
Aramaic ܐܪܡܝܐ arc 1,637 5,984 93,084 2 15,512 17 0 109.69
Kabiye Kabɩyɛ kbp 1,599 3,113 14,409 1 2,220 8 0 ——
Kabardian Адыгэбзэ kbd 1,583 6,639 42,206 1 7,311 5 0 64.85
Santali ᱥᱟᱱᱛᱟᱲᱤ sat 1,507 5,730 30,588 3 1,760 34 0 ——
Tok Pisin Tok Pisin tpi 1,506 5,701 84,622 1 9,578 15 0 115.17
Tetum tetun tet 1,473 3,779 62,994 2 7,166 10 0 40.85
Igbo Igbo ig 1,430 5,950 64,661 2 9,837 39 0 108.58
Kikuyu Gĩkũyũ ki 1,362 2,885 19,535 1 5,528 11 0 8.47
Zulu isiZulu zu 1,351 5,469 54,216 1 12,457 69 0 92.1
Wolof Wolof wo 1,342 4,944 101,564 2 11,429 18 0 147.99
Nauruan Dorerin Naoero na 1,307 4,116 81,031 1 8,662 9 0 90.93
Lojban la .lojban. jbo 1,246 5,591 109,019 3 12,124 15 0 237.11
Aromanian armãneashti roa-rup 1,226 3,871 198,517 1 11,028 9 0 238.7
Bislama Bislama bi 1,220 2,934 37,979 1 8,584 14 0 ——
Lak лакку lbe 1,219 12,259 45,233 1 6,513 8 0 302.64
Tahitian reo tahiti ty 1,203 2,874 52,099 1 5,722 7 0 34.98
Moksha мокшень mdf 1,198 6,453 50,770 2 6,918 15 0 151.38
Kongo Kongo kg 1,198 2,801 42,726 2 7,731 12 0 27.31
Tulu ತುಳು tcy 1,180 4,799 68,352 2 2,987 40 0 ——
Luganda Luganda lg 1,178 4,410 25,046 1 5,262 15 0 42.75
Sranan Sranantongo srn 1,075 2,654 38,044 1 5,699 10 0 30.93
Ingush ГӀалгӀай inh 1,050 4,136 30,545 2 1,325 7 0 ——
Xhosa isiXhosa xh 1,029 3,590 31,083 1 8,616 32 0 53.63
Atikamekw Atikamekw atj 1,013 1,926 11,159 5 2,004 21 0 ——
Latgalian latgaļu ltg 921 3,090 33,878 1 5,216 11 0 60.81
Cherokee ᏣᎳᎩ chr 826 3,644 43,812 1 15,414 11 0 139.94
Samoan Gagana Samoa sm 812 2,905 39,275 1 6,840 10 0 89.82
Norfolk Norfuk / Pitkern pih 792 3,063 41,108 2 8,012 12 0 ——
Oromo Oromoo om 786 3,268 31,718 1 6,623 15 0 96.78
Akan Akan ak 729 2,625 21,313 1 8,608 23 0 ——
Tswana Setswana tn 699 2,814 22,623 1 6,807 7 0 73.6
Old Church Slavonic словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ cu 689 4,916 76,826 2 18,534 18 0 588.2
Twi Twi tw 675 2,190 19,830 1 8,849 14 0 45.61
Tsonga Xitsonga ts 673 3,531 34,517 2 6,708 21 0 ——
Romani romani čhib rmy 666 2,562 47,767 2 13,536 11 0 151.1
Bambara bamanankan bm 654 2,534 38,161 1 7,967 11 0 ——
Sesotho Sesotho st 630 2,338 23,530 1 7,136 25 0 ——
Cheyenne Tsetsêhestâhese chy 618 2,181 23,346 1 8,338 17 0 68.47
Kirundi Kirundi rn 615 2,331 20,314 1 5,953 13 0 ——
Gothic 𐌲𐌿𐍄𐌹𐍃𐌺 got 600 3,252 39,746 3 13,990 17 0 ——
Tumbuka chiTumbuka tum 576 1,898 21,641 1 5,630 8 0 60.06
Chichewa Chi-Chewa ny 551 2,918 21,460 4 6,075 14 0 ——
Swati SiSwati ss 492 2,238 36,746 3 6,030 16 0 ——
Chamorro Chamoru ch 482 2,379 20,770 1 11,563 14 0 ——
Pontic Ποντιακά pnt 465 1,984 34,226 1 7,549 13 0 ——
Fijian Na Vosa Vakaviti fj 432 1,994 31,033 1 6,255 9 0 ——
Adyghe адыгабзэ ady 415 1,892 8,304 2 3,677 10 0 ——
Inuktitut ᐃᓄᒃᑎᑐᑦ/inuktitut iu 401 2,878 42,328 2 13,198 9 0 ——
Venda Tshivenda ve 369 1,760 17,093 1 4,978 8 0 ——
Ewe eʋegbe ee 355 2,664 48,137 2 10,586 15 0 ——
Kashmiri कॉशुर / کٲشُر ks 325 1,645 32,719 1 7,058 7 0 ——
Inupiak Iñupiak ik 278 2,301 35,306 1 6,123 9 0 ——
Sango Sängö sg 261 1,628 19,469 1 4,758 9 0 ——
Fula Fulfulde ff 229 1,822 21,379 1 6,049 13 0 ——
Dzongkha ཇོང་ཁ dz 218 2,031 27,836 1 7,135 11 0 ——
Tigrinya ትግርኛ ti 168 1,636 19,490 1 6,374 9 0 ——
Dinka Thuɔŋjäŋ din 106 717 4,854 1 4,137 10 0 ——
Cree Nēhiyawēwin / ᓀᐦᐃᔭᐍᐏᐣ cr 104 2,028 33,996 2 11,639 12 0 ——
Ndonga Oshiwambo ng 8 441 5,920 1 1,755 1 0 ——
Choctaw Choctaw cho 6 200 4,217 1 1,411 0 0 ——
Kuanyama Kwanyama kj 4 113 3,547 1 1,140 0 0 ——
Marshallese Ebon mh 4 205 4,211 1 1,743 0 0 ——
Hiri Motu Hiri Motu ho 3 128 3,785 1 1,278 0 0 ——
Nuosu ꆇꉙ ii 3 188 11,652 1 1,546 0 0 ——
Afar Qafár af aa 1 509 4,680 1 3,249 0 0 ——
Muscogee Mvskoke mus 1 114 3,600 1 1,630 0 0 ——
Herero Otsiherero hz 0 175 4,480 1 3,081 1 0 ——
Kanuri Kanuri kr 0 161 4,640 1 4,404 0 0 ——
Shan ၽႃႇသႃႇတႆး shn 0 0 0 0 0 0 0 ——
Western Armenian Արեւմտահայերէն hyw 0 0 0 0 0 0 0 ——
N'Ko ߒߞߏ nqo 0 0 0 0 0 0 0 ——
Balinese Bali ban 0 0 0 0 0 0 0 ——

About

A Language Independent Wikipedia Text Corpus Downloader


Languages

Language:Python 100.0%