This projects prepares the DBpedia dataset for loading into Dgraph. This comprises the steps download, extraction, pre-processing and bulk loading. The first two steps can be done with provided shell scripts. The third step by using Apache Spark transformations. The last step uses the Dgraph Bulk Loader.
I was looking for a large real-world graph dataset to load into a Dgraph cluster to ultimately test the spark-dgraph-connector. Dgraph organizes the graph around predicates, so that dataset should contain predicates with these characteristics:
- a predicate that links a deep hierarchy of nodes
- a predicate that links a deep network of nodes
- a predicate that links strongly connected components
- a predicate with a lot of data, ideally a long string that exists for every node and with multiple languages
- a predicate with geo coordinates
- numerous predicates, to have a large schema
- a long-tail predicate frequency distribution: a few predicates have high frequency (and low selectivity), most predicates have low frequency (and high selectivity)
- predicates that, if they exist for a node:
- have a single occurrence (single value)
- have multiple occurrences (value list)
- real-world predicate names in multiple languages
- various data types and strings in multiple languages
A dataset that checks all these boxes can be found at the DBpedia project. They extract structured information from the Wikipedia project and provide them in RDF format. However, that RDF data requires some preparation before it can be loaded into Dgraph. The size of the datasets requires a scalable pre-processing step.
This project uses Apache Spark to bring real-work graph data into a Dgraph-compatible shape. Read the detailed tutorial on the pre-processing steps.
This tutorial has the following requirements:
- Unix command line shell bash
- Apache Maven installed
- Docker CLI installed
- A multi-core machine with SSD disk
- Disk space: 19 GB to download, 374 GB to extract, 42 GB for parquet, 21 GB for gz'ed RDF, 64 GB Dgraph alpha space and 75 GB temporary space
Run the following command on an Ubuntu instance to get it setup:
sudo apt update
sudo apt install -y git maven parallel docker.io zip
sudo usermod -aG docker ${USER} # you may need run `sudo systemctl restart docker` and re-login for this to work
git clone https://github.com/G-Research/dgraph-dbpedia.git
cd dgraph-dbpedia
This tutorial uses the following datasets from DBpedia project:
dataset | filename | description |
---|---|---|
labels | labels_{lang}.ttl |
Each article has a single title in the article's language. |
category | article_categories_{lang}.ttl |
Some articles link to categories, multiple categories allowed. |
skos | skos_categories_{lang}.ttl |
Categories link to broader categories. Forms a deep hierarchy. |
inter-language links | interlanguage_links_{lang}.ttl |
Articles link to the same article in all other languages. Forms strongly connected components. |
page links | page_links_{lang}.ttl |
Articles link to other articles or other resources. Forms a network of articles. |
infobox | infobox_properties_{lang}.ttl |
Some articles have infoboxes. Provides structured information as key-value tables. |
geo coordinates | geo_coordinates_{lang}.ttl |
Some articles have geo coordinates of type Point . |
en_uris | {dataset}_en_uris_{lang}.ttl |
Non-English labels , infobox and category predicates for English articles. Provides multiple language strings and predicates for articles. |
The infobox
dataset provides real-world user-generated multi-language predicates.
The other datasets provide a fixed set of predicates each.
The {dataset}_en_uris_{lang}.ttl
dataset is special. For three datasets labels
, infobox
and category
,
it provides non-English data for English articles. All data from these datasets are stored in parquet
and RDF under the en-{lang}
languages. For instance, when you run DbpediaDgraphSparkApp
with
languages en
and de
, then you will also get the {dataset}_en_uris_de.ttl
dataset as language en-de
as well.
Without the en
language, you will not get any of the {dataset}_en_uris_{lang}.ttl
datasets.
You can easily prepare any subset of this dataset. Download only those datasets and languages that you are interested in. Start with a small language to go through these steps once. Then download and prepare all languages that you want.
Even when you download and pre-process all datasets and all languages into RDF you will be able to pick datasets and languages when bulk-loading data into Dgraph. RDF files are split by languages and the schema is additionally split by dataset, so that you can easily load the schema only of the dataset that you are bulk loading.
Use the download.sh
script to download the datasets and languages that you want to load into Dgraph:
./download.sh [path] [languages]
Both arguments path
and languages
are optional. Without, the script downloads all languages into
./dbpedia
. To download only selected languages, run
./download.sh dbpedia "en es fr zh jp"
You can find all available releases and datasets at http://downloads.dbpedia.org.
Stats for each release date are published in the statsitics
sub-directory,
e.g. http://downloads.dbpedia.org/2016-10/statistics.
Downloading the four datasets in all languages will require 19 GB disk space.
DBpedia datasets are compressed and will be pre-processed using Spark. The compressed files cannot be processed efficiently, so they have to be extracted first.
Run the extract.sh
script:
./extract.sh dbpedia/2016-10
Extracting the four datasets in all languages will require 374 GB disk space.
The provided Scala Spark code pre-processes the downloaded and extracted datasets and produces Dgraph compatible RDF triples.
First we produce parquet files from all ttl
files. All languages will be stored
in one parquet directory per dataset, where languages can still be selected in later steps.
MAVEN_OPTS=-Xmx2g mvn compile exec:java -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="uk.co.gresearch.dgraph.dbpedia.DbpediaToParquetSparkApp" -Dexec.args="dbpedia 2016-10"
Secondly, process these parquet files into RDF triple files:
MAVEN_OPTS=-Xmx8g mvn compile exec:java -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="uk.co.gresearch.dgraph.dbpedia.DbpediaDgraphSparkApp" -Dexec.args="dbpedia 2016-10"
These commands can optionally be given a comma separated list of language codes: -Dexec.args="dbpedia 2016-10 en,es,fr,de"
.
Without those language codes, all languages will be processed.
There are more options at the beginning of the main
method in DbpediaDgraphSparkApp.scala
:
val externaliseUris = false
val removeLanguageTags = false
val topInfoboxPropertiesPerLang = Some(100)
val printStats = true
With externaliseUris = true
the application turns all URIs into blank nodes and produces a external_ids.rdf
file
which provides the <xid>
predicate for each blank node with the URI as a string value.
See External IDs for more information.
Language tags can be removed from any value with removeLanguageTags = true
. The @lang
directives
are then also removed from the schema files schema.dgraph
and schema.indexed.dgraph
.
Only the 100
largest infobox properties are provided in the RDF files with topInfoboxPropertiesPerLang = Some(100)
.
This can be used to control the size of the schema while allowing to add rich predicates.
Use None
to get all 1 mio predicates from all datasets and languages.
With printStats = false
you can turn-off some stats, which will reduce the processing time of the application.
The DbpediaDgraphSparkApp
requires at least 1 GB per CPU core, ideally 2 GB.
You can set the memory available to the application via the MAVEN_OPTS
environment variable:
MAVEN_OPTS=-Xmx8g mvn compile exec:java …
On termination, the application prints some information like the following line:
memory spill: 51 GB disk spill: 4 GB peak mem per host: 874 MB
This provides an indication if more memory should be given to the application.
A huge number (upper two digits GB) for disk spill
indicates lag of memory per core.
Above example
- downloads to
dbpedia/2016-10/core-i18n/{lang}/{dataset}_{lang}.ttl.bz2
- extracts to
dbpedia/2016-10/core-i18n/{lang}/{dataset}_{lang}.ttl
- loads into
dbpedia/2016-10/core-i18n/{dataset}.parquet
- pre-processes to
dbpedia/2016-10/core-i18n/{dataset}.rdf
- write schema to
dbpedia/2016-10/core-i18n/schema.dgraph
anddbpedia/2016-10/core-i18n/schema.indexed.dgraph
Individual languages can be found in dbpedia/2016-10/core-i18n/{dataset}.rdf/lang={language}
.
Besides the datasets article_categories.rdf
, infobox_properties.rdf
, interlanguage_links.rdf
,
labels.rdf
, skos_categories_{lang}.rdf
, page_links_{lang}.rdf
and geo_coordinates_{lang}.rdf
,
you can find data from {dataset}_en_uris_{lang}.ttl
in {dataset}.rdf/lang=en-{lang}
.
External ids can be found (when externaliseUris = true
) in external_ids.rdf
.
The schema for all predicates with and without indices are provided in schema.dgraph
and
schema.indexed.dgraph
, respectively.
Load all datasets and all languages:
./dgraph.bulk.sh $(pwd)/dbpedia/2016-10/core-i18n $(pwd)/dbpedia/2016-10/bulk "/data/schema.indexed.dgraph/dataset=*/lang=*/part-*.txt" "/data/*.rdf/lang=*/part-*.txt.gz"
Load a subset of datasets and languages, defined via langs
and datasets
:
export langs="en|en-es|en-fr|en-zh|en-jp"; export datasets="labels|infobox_properties"; ./dgraph.bulk.sh $(pwd)/dbpedia/2016-10/core-i18n $(pwd)/dbpedia/2016-10/bulk "/data/schema.dgraph/dataset=@($datasets)/lang=@($langs|any)/part-*.txt" "/data/@($datasets).rdf/lang=@($langs)/part-*.txt.gz"
Either use schema.indexed.dgraph
with bulk loader to populate the indices during bulk loading,
or bulk load with schema.dgraph
and mutate the schema to schema.indexed.dgraph
afterwards.
Bulk-loading the full dataset takes 3 hours and requires 120 GB RAM and 65 GB disk space (v20.11.0-g1003e71bd with schema.indexed.dgraph) and 2 1/2 hours with 85 GB RAM and 65 GB disk space (v21.03.0).
Start the Dgraph cluster on your bulk-loaded data:
./dgraph.serve.sh $(pwd)/dbpedia/2016-10/bulk
Serving the full dataset (with top-100 infobox properties) requires at least 4 GB RAM.
Then open up Ratel UI:
http://localhost:8000/?latest#
Connect to the cluster and then query in the Console.
Query for the wikipedia article about Wikipedia (http://dbpedia.org/page/Wikipedia):
{
query(func: eq(<xid>, "http://dbpedia.org/resource/Wikipedia")) {
uid
xid
<http://www.w3.org/2000/01/rdf-schema#label>@*
<http://www.w3.org/2002/07/owl#sameAs> {
uid
xid
<http://www.w3.org/2000/01/rdf-schema#label>@*
}
}
}
Result:
{
"data": {
"query": [
{
"uid": "0x191439",
"xid": "http://dbpedia.org/resource/Wikipedia",
"http://www.w3.org/2000/01/rdf-schema#label@ne": "विकिपिडिया",
"http://www.w3.org/2000/01/rdf-schema#label@yo": "Wikipedia",
"http://www.w3.org/2000/01/rdf-schema#label@sco": "Wikipaedia",
"http://www.w3.org/2000/01/rdf-schema#label@gd": "Uicipeid",
"http://www.w3.org/2000/01/rdf-schema#label@si": "විකිපීඩියා",
"http://www.w3.org/2000/01/rdf-schema#label@pa": "ਵਿਕੀਪੀਡੀਆ",
"http://www.w3.org/2000/01/rdf-schema#label@pms": "Wikipedia",
"http://www.w3.org/2000/01/rdf-schema#label@ce": "Википеди",
"http://www.w3.org/2000/01/rdf-schema#label@az": "Vikipediya",
"http://www.w3.org/2000/01/rdf-schema#label@nah": "Huiquipedia",
"http://www.w3.org/2000/01/rdf-schema#label@zh": "维基百科",
"http://www.w3.org/2000/01/rdf-schema#label@su": "Wikipédia",
"http://www.w3.org/2000/01/rdf-schema#label@ml": "വിക്കിപീഡിയ",
"http://www.w3.org/2000/01/rdf-schema#label@vo": "Vükiped",
"http://www.w3.org/2000/01/rdf-schema#label@ka": "ვიკიპედია",
"http://www.w3.org/2000/01/rdf-schema#label@ckb": "ویکیپیدیا",
"http://www.w3.org/2000/01/rdf-schema#label@be": "Вікіпедыя",
"http://www.w3.org/2000/01/rdf-schema#label@sl": "Wikipedija",
"http://www.w3.org/2000/01/rdf-schema#label@bpy": "উইকিপিডিয়া",
"http://www.w3.org/2000/01/rdf-schema#label@am": "ውክፔዲያ",
"http://www.w3.org/2000/01/rdf-schema#label@my": "ဝီကီပီးဒီးယား",
"http://www.w3.org/2000/01/rdf-schema#label@ko": "위키백과",
"http://www.w3.org/2000/01/rdf-schema#label@te": "వికీపీడియా",
"http://www.w3.org/2000/01/rdf-schema#label@th": "วิกิพีเดีย",
"http://www.w3.org/2000/01/rdf-schema#label@sah": "Бикипиэдьийэ",
"http://www.w3.org/2000/01/rdf-schema#label@ur": "ویکیپیڈیا",
"http://www.w3.org/2002/07/owl#sameAs": [
{
"uid": "0x18cd82",
"xid": "http://tl.dbpedia.org/resource/Wikipedia",
"http://www.w3.org/2000/01/rdf-schema#label@tl": "Wikipedia"
},
{
"uid": "0x558f93",
"xid": "http://de.dbpedia.org/resource/Wikipedia",
"http://www.w3.org/2000/01/rdf-schema#label@de": "Wikipedia"
},
{
"uid": "0x61c424",
"xid": "http://ky.dbpedia.org/resource/Уикипедия",
"http://www.w3.org/2000/01/rdf-schema#label@ky": "Уикипедия"
},
{
"uid": "0x61c425",
"xid": "http://ml.dbpedia.org/resource/വിക്കിപീഡിയ",
"http://www.w3.org/2000/01/rdf-schema#label@ml": "വിക്കിപീഡിയ"
},
…
]
}
]
}
}
Query for things in Berlin, Germany:
{
query(func: near(<http://www.georss.org/georss/point>, [13.41053, 52.52437], 1000), first: 1000) {
uid
name: <http://www.w3.org/2000/01/rdf-schema#label>@*
location: <http://www.georss.org/georss/point>
}
}
dataset | triples | nodes | predicates | .bz | .ttl | .parquet | .rdf | schema |
---|---|---|---|---|---|---|---|---|
labels | 94,410,463 | 76,478,687 | 1 | 1 GB | 12 GB | 2 GB | 1 GB | Article --rdfs:label-> lang string |
article_categories | 149,254,994 | 41,655,032 | 1 | 1 GB | 22 GB | 3 GB | 2 GB | Article --dcterms:subject-> Category |
skos_categories | 32,947,632 | 8,447,863 | 4 | 0.3 GB | 8 GB | 0.7 GB | 0.4 GB | Category --skos-core:broader-> Category |
interlanguage_links | 546,769,314 | 49,426,513 | 1 | 5 GB | 92 GB | 11 GB | 5 GB | Article --owl:sameAs-> Article |
page_links | 1,042,567,811 | 76,392,179 | 1 | 7 GB | 154 GB | 17 GB | 10 GB | Article --dbpedia:wikiPageWikiLink-> Article |
geo_coordinates | 1,825,817 | 1,825,817 | 1 | 0.05 GB | 1 GB | 0.1 GB | 0.03 GB | Article --georss:point-> geoJSON |
top-100 infobox_properties | 298,289,529 | 27,332,261 | 12,714 | 3 GB | Article --property-> literal or uri |
|||
all (with top-100 infobox) | 2,166,065,560 | 86,737,376 | 12,723 | 21 GB | ||||
all infobox_properties | 596,338,417 | 29,753,821 | 1,050,875 | 4 GB | 87 GB | 9 GB | 6 GB | Article --property-> literal or uri |
all (with all infobox) | 2,396,517,559 | 86,737,376 | 1,050,884 | 19 GB | 372 GB | 42 GB | 24 GB |
The original dataset files are .bz2
compressed and 19 GB in size. They extract to .ttl
files of 374 GB size.
Those loaded into parquet consume 42 GB, processed into TTL RDF files (with top 100 infobox properties)
occupy 21 GB .gz
compressed and 317 GB uncompressed.
The RDF loaded into Dgraph requires 36 GB for the out
directory with indices and 27 GB without, 7 GB for xidmap
, and 16 GB for zw
.
Loading the entire dataset into parquets takes 2 hours on an 8-core machine with SSD disk and 2 GB JVM memory. This requires 10 to 30 GB of temporary disk space, depending on the dataset that your are loading. With a 32-core machine and 32 GB JVM memory, this takes 35 min and 32 GB temporary disk space.
Processing the dataset into RDF takes 1 1/2 hours on a 8-core machine with SSD disk and 12 GB JVM memory. With a 32-core machine and 96 GB JVM memory, this takes 30 min and 32 GB temporary disk space.
Loading the RDF into Dgraph takes 3 hours with indices and 2 1/2 hours without, and requires 120 GB RAM.
The following language codes are available for the 2016-10
datasets in core-i18n
:
af als am an ar arz ast az azb ba bar be bg bn bpy br bs bug
ca ce ceb ckb cs cv cy da de el en eo es et eu fa fi fo fr fy
ga gd gl gu he hi hr hsb ht hu hy ia id io is it ja jv ka kk
kn ko ku ky la lb li lmo lt lv mg min mk ml mn mr mrj ms my
mzn nah nap nds ne new nl nn no oc or os pa pl pms pnb pt qu
ro ru sa sah scn sco sh si sk sl sq sr su sv sw ta te tg th
tl tr tt uk ur uz vec vi vo wa war yi yo zh
These datasets provide the following number of triples per language:
lang | labels | interlanguage | page links | categories | skos | geo | infobox | top 100 infobox |
---|---|---|---|---|---|---|---|---|
af | 62,902 | 1,875,080 | 1,025,708 | 64,901 | 27,722 | 468,032 | 254,962 | |
als | 27,077 | 942,457 | 544,847 | 45,067 | 12,328 | 1 | 216,224 | 154,004 |
am | 19,697 | 566,489 | 116,211 | 16,300 | 3,194 | 250 | 28,372 | 23,039 |
an | 57,240 | 1,790,964 | 1,155,374 | 62,983 | 65,608 | 37 | 384,921 | 267,495 |
ar | 871,405 | 8,456,373 | 9,470,455 | 1,942,195 | 797,717 | 12,899 | 4,574,804 | 2,727,047 |
arz | 22,803 | 802,974 | 235,009 | 48,724 | 19,255 | 689 | 54,399 | 26,510 |
ast | 55,036 | 1,588,057 | 1,686,826 | 161,813 | 20,438 | 205 | 396,939 | 299,062 |
az | 143,422 | 3,436,415 | 1,519,074 | 219,855 | 194,832 | 780 | 783,088 | 510,857 |
azb | 13,391 | 236,252 | 111,743 | 13,468 | 6,828 | 28 | 76,616 | 65,412 |
ba | 42,116 | 1,274,257 | 448,334 | 71,464 | 76,453 | 48 | 500,493 | 431,324 |
bar | 34,916 | 1,038,524 | 633,300 | 48,230 | 27,264 | 189,491 | 144,376 | |
be | 217,992 | 4,071,053 | 2,492,728 | 307,957 | 216,374 | 1,182 | 1,036,032 | 763,998 |
bg | 333,098 | 5,662,926 | 5,613,997 | 491,931 | 209,189 | 2,394 | 1,199,635 | 751,318 |
bn | 214,617 | 2,017,159 | 1,238,572 | 152,577 | 86,507 | 2,824 | 676,004 | 357,148 |
bpy | 25,404 | 1,013,214 | 206,059 | 33,187 | 18,421 | 242 | 288,330 | 278,184 |
br | 79,421 | 2,486,979 | 1,349,937 | 122,690 | 66,130 | 25 | 163,068 | 142,927 |
bs | 170,881 | 2,963,799 | 2,531,453 | 184,450 | 189,966 | 486 | 1,122,106 | 846,358 |
bug | 14,230 | 459,340 | 128,339 | 14,239 | 1,158 | 3,345 | 3,286 | |
ca | 868,322 | 8,919,143 | 15,628,064 | 999,079 | 254,533 | 89,582 | 5,555,600 | 3,829,640 |
ce | 157,642 | 870,894 | 1,086,175 | 165,082 | 14,290 | 748 | 2,152,346 | 2,126,202 |
ceb | 4,864,101 | 12,223,691 | 40,602,693 | 6,060,892 | 543,714 | 297,725 | 48,599,129 | 48,590,356 |
ckb | 56,402 | 1,623,083 | 256,694 | 46,733 | 78,501 | 710 | 121,163 | 88,801 |
cs | 593,897 | 7,280,033 | 13,186,669 | 1,455,928 | 441,698 | 1 | 3,548,877 | 2,356,619 |
cv | 38,674 | 972,803 | 484,472 | 66,606 | 12,986 | 57 | 322,440 | 277,126 |
cy | 124,873 | 2,670,515 | 4,791,110 | 189,369 | 99,216 | 1,114 | 5,867,747 | 5,758,012 |
da | 353,730 | 5,433,263 | 6,026,849 | 574,511 | 191,523 | 18,871 | 1,827,254 | 967,812 |
de | 3,343,471 | 16,120,815 | 68,802,723 | 8,185,499 | 1,155,538 | 9 | 14,048,415 | 8,898,574 |
el | 185,210 | 3,663,190 | 3,447,861 | 314,211 | 153,419 | 12,774 | 497,520 | 235,298 |
en | 12,845,252 | 29,476,295 | 183,605,695 | 23,990,512 | 6,083,029 | 580,892 | 52,680,098 | 26,966,738 |
en-* | 17,933,697 | 36,147,553 | 178,020,543 | 24,657,904 | ||||
eo | 393,119 | 6,096,191 | 6,008,777 | 527,912 | 203,701 | 1,949,066 | 1,416,087 | |
es | 2,906,977 | 15,451,078 | 44,035,493 | 3,622,137 | 1,348,394 | 40,737 | 10,858,241 | 5,925,338 |
et | 260,421 | 3,761,071 | 4,392,950 | 320,249 | 92,385 | 507 | 501,304 | 339,731 |
eu | 333,949 | 7,018,090 | 4,192,367 | 427,983 | 202,547 | 2,046,149 | 1,703,815 | |
fa | 1,891,729 | 10,313,741 | 10,296,006 | 2,381,913 | 1,652,569 | 31,687 | 6,827,706 | 4,054,788 |
fi | 637,527 | 7,364,185 | 10,472,218 | 1,034,362 | 240,662 | 16,105 | 3,793,465 | 2,166,251 |
fo | 17,243 | 1,000,468 | 235,030 | 22,348 | 15,002 | 38 | 49,932 | 30,778 |
fr | 3,241,245 | 18,386,234 | 64,055,280 | 6,549,308 | 1,394,960 | 37,801 | 16,052,506 | 9,190,531 |
fy | 56,325 | 1,397,933 | 1,087,488 | 104,970 | 74,218 | 92,718 | 74,624 | |
ga | 45,636 | 1,574,022 | 475,562 | 64,936 | 19,600 | 680 | 82,344 | 49,163 |
gd | 22,609 | 758,946 | 229,342 | 26,292 | 5,415 | 1,133 | 22,202 | 20,921 |
gl | 184,059 | 3,767,468 | 4,199,110 | 317,526 | 175,457 | 10,860 | 597,615 | 314,287 |
gu | 29,004 | 381,512 | 557,348 | 59,909 | 3,521 | 190 | 406,231 | 375,655 |
he | 360,725 | 4,848,271 | 9,061,515 | 614,735 | 185,868 | 15,205 | 1,287,035 | 757,865 |
hi | 157,515 | 2,088,013 | 1,612,841 | 208,076 | 75,325 | 3 | 483,805 | 278,581 |
hr | 204,330 | 3,720,040 | 4,086,321 | 281,986 | 61,525 | 3,137 | 1,274,116 | 805,129 |
hsb | 15,246 | 684,769 | 191,381 | 21,852 | 14,075 | 6 | 100,225 | 93,021 |
ht | 57,587 | 1,033,648 | 360,968 | 126,744 | 3,804 | 6 | 245,047 | 243,681 |
hu | 577,761 | 8,135,897 | 12,321,943 | 1,114,368 | 202,481 | 3,844,734 | 2,021,889 | |
hy | 519,477 | 4,474,158 | 3,911,694 | 420,750 | 152,196 | 1,141 | 2,656,709 | 2,171,465 |
ia | 23,041 | 939,137 | 135,455 | 17,079 | 10,803 | 11 | 1,334 | 1,139 |
id | 660,719 | 6,611,139 | 8,288,264 | 596,902 | 212,629 | 12,455 | 2,753,661 | 1,543,213 |
io | 30,601 | 1,492,156 | 443,393 | 37,164 | 20,514 | 3 | 96,793 | 89,943 |
is | 65,226 | 1,848,537 | 876,343 | 61,213 | 46,129 | 61 | 171,318 | 128,153 |
it | 1,949,794 | 15,542,736 | 43,346,247 | 1,786,162 | 1,161,845 | 15,681 | 20,207,833 | 13,932,773 |
ja | 1,663,028 | 10,522,965 | 51,368,506 | 4,271,371 | 768,230 | 25,338 | 7,763,985 | 2,530,605 |
jv | 64,481 | 1,580,360 | 834,417 | 61,553 | 54,885 | 532 | 436,879 | 320,303 |
ka | 144,916 | 3,587,693 | 1,994,541 | 208,311 | 114,438 | 2,433 | 1,236,797 | 897,810 |
kk | 267,062 | 4,107,418 | 2,260,768 | 326,357 | 68,748 | 605 | 2,998,438 | 2,836,655 |
kn | 27,449 | 590,437 | 549,420 | 47,789 | 9,503 | 274 | 169,705 | 94,659 |
ko | 670,310 | 7,838,730 | 9,968,592 | 1,420,036 | 777,602 | 16,353 | 2,381,529 | 1,074,711 |
ku | 36,155 | 893,677 | 311,644 | 34,256 | 20,105 | 78,004 | 64,317 | |
ky | 61,403 | 1,071,799 | 337,477 | 86,725 | 5,528 | 19 | 452,740 | 436,223 |
la | 176,958 | 4,474,648 | 2,491,031 | 286,766 | 131,377 | 710 | 380,644 | 307,013 |
lb | 58,701 | 1,982,488 | 1,322,664 | 105,622 | 86,392 | 125 | 172,366 | 145,641 |
li | 43,769 | 578,532 | 367,905 | 17,166 | 7,151 | 1 | 10,789 | 10,527 |
lmo | 48,949 | 1,369,828 | 459,111 | 24,824 | 15,727 | 358 | 466,870 | 430,165 |
lt | 259,153 | 3,766,343 | 4,031,031 | 277,495 | 90,806 | 7,581 | 1,526,071 | 1,093,156 |
lv | 168,190 | 2,991,527 | 1,918,216 | 174,077 | 95,940 | 1,109 | 798,879 | 413,172 |
mg | 124,431 | 1,797,354 | 954,285 | 186,863 | 8,580 | 34,897 | 337,922 | 336,717 |
min | 222,999 | 1,341,378 | 1,753,645 | 219,550 | 3,496 | 41 | 1,876,798 | 1,861,140 |
mk | 128,202 | 3,598,089 | 2,095,968 | 184,338 | 217,204 | 4,304 | 641,009 | 346,036 |
ml | 117,149 | 1,528,680 | 931,001 | 97,310 | 66,791 | 1,213 | 488,500 | 260,402 |
mn | 24,143 | 1,048,103 | 364,438 | 49,038 | 35,692 | 93 | 179,136 | 88,706 |
mr | 82,601 | 1,682,500 | 639,334 | 95,619 | 47,193 | 579 | 350,277 | 165,420 |
mrj | 12,491 | 498,892 | 122,714 | 16,304 | 3,707 | 568 | 562 | |
ms | 332,229 | 5,382,317 | 4,479,346 | 430,808 | 93,476 | 3,757 | 3,659,011 | 2,703,715 |
my | 41,260 | 699,731 | 354,978 | 84,015 | 11,120 | 139 | 460,582 | 417,278 |
mzn | 17,707 | 388,110 | 113,356 | 17,076 | 4,289 | 275 | 94,895 | 77,157 |
nah | 17,088 | 648,422 | 102,130 | 14,371 | 9,360 | 9 | 56,229 | 37,643 |
nap | 15,652 | 804,392 | 185,862 | 38,250 | 3,730 | 24 | 15,386 | 9,569 |
nds | 34,400 | 1,019,548 | 739,428 | 49,845 | 7,998 | 94,701 | 83,555 | |
ne | 36,048 | 776,358 | 294,915 | 72,685 | 6,725 | 267 | 227,119 | 151,236 |
new | 100,918 | 1,193,563 | 811,948 | 77,621 | 93,038 | 40 | 301,295 | 278,422 |
nl | 2,554,610 | 16,134,593 | 31,823,749 | 2,764,083 | 405,781 | 273 | 8,918,883 | 7,058,397 |
nn | 206,657 | 3,652,079 | 2,787,235 | 361,068 | 220,738 | 154 | 1,064,516 | 732,532 |
no | 709,542 | 7,855,478 | 12,087,904 | 1,585,810 | 472,339 | 1 | 2,963,577 | 1,752,582 |
oc | 98,027 | 3,388,435 | 1,809,710 | 111,315 | 56,605 | 290 | 1,491,784 | 1,453,162 |
or | 22,359 | 736,820 | 155,077 | 27,071 | 13,014 | 129 | 125,961 | 83,250 |
os | 20,315 | 841,048 | 103,564 | 14,147 | 19,182 | 57 | 85,549 | 74,792 |
pa | 36,168 | 891,318 | 312,123 | 32,395 | 10,989 | 1 | 301,477 | 186,180 |
pl | 1,575,762 | 13,958,693 | 30,792,002 | 3,244,389 | 639,723 | 11,769,485 | 7,345,068 | |
pms | 66,943 | 1,802,868 | 559,036 | 75,942 | 17,108 | 1,684 | 646,961 | 642,742 |
pnb | 45,336 | 1,366,423 | 561,884 | 102,912 | 12,324 | 30 | 109,729 | 89,559 |
pt | 1,667,327 | 14,071,743 | 26,111,747 | 2,373,020 | 955,512 | 7,566 | 7,273,995 | 4,220,599 |
qu | 39,107 | 1,154,083 | 482,813 | 50,500 | 22,010 | 6 | 63,522 | 48,190 |
ro | 865,444 | 8,232,620 | 8,715,289 | 826,522 | 553,650 | 8,412 | 6,192,337 | 4,450,825 |
ru | 3,033,613 | 15,350,832 | 42,153,229 | 3,526,953 | 1,291,101 | 17,756 | 15,382,287 | 8,985,787 |
sa | 21,614 | 904,789 | 153,809 | 51,314 | 20,169 | 69 | 56,919 | 36,343 |
sah | 15,801 | 712,118 | 134,674 | 20,229 | 16,295 | 69 | 47,695 | 28,315 |
scn | 41,436 | 1,166,819 | 317,603 | 31,203 | 5,130 | 2 | 27,961 | 24,454 |
sco | 52,878 | 2,301,910 | 1,023,709 | 233,832 | 114,366 | 4,470 | 699,297 | 408,021 |
sh | 3,960,865 | 6,545,293 | 11,431,459 | 816,552 | 155,263 | 2,143 | 5,089,465 | 3,821,276 |
si | 24,285 | 463,985 | 188,502 | 18,755 | 13,546 | 556 | 99,709 | 47,463 |
sk | 278,133 | 5,968,648 | 4,528,315 | 392,427 | 247,574 | 104 | 2,370,562 | 1,554,812 |
sl | 217,345 | 3,989,018 | 4,883,529 | 467,838 | 181,334 | 1,355,263 | 859,228 | |
sq | 84,404 | 1,964,847 | 980,879 | 120,438 | 33,587 | 2,985 | 428,661 | 325,805 |
sr | 873,929 | 6,715,446 | 7,183,655 | 636,686 | 135,090 | 1,450 | 2,073,765 | 1,377,521 |
su | 22,545 | 824,097 | 322,313 | 36,203 | 24,601 | 280 | 101,861 | 76,459 |
sv | 5,858,202 | 20,493,578 | 63,821,423 | 8,408,876 | 1,173,139 | 299,191 | 41,295,967 | 38,208,922 |
sw | 51,278 | 1,628,706 | 591,589 | 65,319 | 36,065 | 214 | 151,412 | 123,731 |
ta | 125,998 | 1,927,015 | 1,487,866 | 187,259 | 61,573 | 2,703 | 800,562 | 452,151 |
te | 90,316 | 800,772 | 1,140,325 | 102,931 | 31,923 | 548 | 1,118,134 | 994,441 |
tg | 82,736 | 1,177,613 | 532,675 | 121,149 | 32,152 | 6,517 | 424,596 | 408,041 |
th | 244,223 | 3,508,792 | 3,246,722 | 301,716 | 166,312 | 3,846 | 1,321,855 | 566,476 |
tl | 164,847 | 2,160,417 | 772,135 | 88,863 | 42,157 | 969 | 236,413 | 139,259 |
tr | 521,200 | 7,048,194 | 7,850,843 | 1,041,967 | 700,464 | 8,897 | 2,991,863 | 1,857,031 |
tt | 123,267 | 1,789,671 | 937,155 | 118,971 | 56,252 | 869 | 668,986 | 598,652 |
uk | 1,049,249 | 10,867,075 | 16,777,605 | 1,751,135 | 542,868 | 55,597 | 7,691,426 | 4,292,580 |
ur | 299,824 | 4,365,211 | 1,802,815 | 227,783 | 827,800 | 1,326 | 2,049,929 | 1,582,299 |
uz | 444,074 | 3,499,627 | 1,061,031 | 105,056 | 40,227 | 45 | 2,418,037 | 2,371,000 |
vec | 17,446 | 812,239 | 174,803 | 12,428 | 10,141 | 43 | 115,966 | 97,023 |
vi | 1,340,313 | 11,542,962 | 15,831,995 | 2,185,821 | 659,046 | 5,895 | 14,322,161 | 11,726,191 |
vo | 242,292 | 2,983,432 | 971,438 | 326,248 | 4,556 | 237 | 543,840 | 543,529 |
wa | 28,779 | 372,075 | 203,900 | 9,443 | 4,530 | 1 | 148 | 148 |
war | 2,094,871 | 10,387,067 | 16,711,018 | 2,084,543 | 445,305 | 49 | 13,056,238 | 13,042,101 |
yi | 24,284 | 656,132 | 232,315 | 20,089 | 7,215 | 908 | 25,925 | 22,087 |
yo | 41,250 | 1,274,425 | 194,529 | 32,508 | 9,965 | 192 | 121,360 | 70,414 |
zh | 1,620,943 | 12,090,949 | 23,162,842 | 2,220,362 | 956,624 | 76,187 | 8,780,134 | 3,868,731 |
lang | labels | categories | infobox | top 100 en infobox |
---|---|---|---|---|
en-af | 35,309 | 51,876 | 391,625 | 35,201 |
en-als | 17,468 | 23,019 | 182,854 | 10,734 |
en-am | 6,600 | 3,012 | 18,094 | 2,509 |
en-an | 26,123 | 39,091 | 310,031 | 9,999 |
en-ar | 300,446 | 1,449,091 | 3,221,789 | 551,376 |
en-arz | 10,387 | 21,663 | 40,954 | 8,665 |
en-ast | 36,925 | 52,425 | 259,835 | 37,846 |
en-az | 69,856 | 83,202 | 514,522 | 40,962 |
en-azb | 4,243 | 791 | 34,889 | 9,268 |
en-ba | 9,025 | 13,600 | 165,147 | 2,014 |
en-bar | 16,816 | 19,160 | 128,336 | 7,623 |
en-be | 70,479 | 123,662 | 606,574 | 38,522 |
en-bg | 162,125 | 291,602 | 961,885 | 132,417 |
en-bn | 38,017 | 68,515 | 554,805 | 181,305 |
en-bpy | 19,066 | 14,251 | 194,010 | 30,096 |
en-br | 50,395 | 71,121 | 140,924 | 16,526 |
en-bs | 42,104 | 71,279 | 561,564 | 35,786 |
en-bug | 14,092 | 14,054 | 1,854 | 1,571 |
en-ca | 355,188 | 480,650 | 3,850,141 | 224,554 |
en-ce | 7,709 | 2,556 | 108,573 | 920 |
en-ceb | 349,264 | 144,447 | 4,642,160 | 919,702 |
en-ckb | 15,714 | 39,415 | 99,844 | 12,194 |
en-cs | 240,361 | 684,538 | 2,409,585 | 362,289 |
en-cv | 14,174 | 7,806 | 108,781 | 859 |
en-cy | 59,641 | 108,101 | 3,755,446 | 1,755,178 |
en-da | 149,242 | 299,896 | 1,197,952 | 135,317 |
en-de | 1,018,321 | 2,874,849 | 6,715,795 | 888,788 |
en-el | 88,852 | 198,997 | 407,741 | 65,627 |
en-eo | 175,195 | 317,953 | 1,507,094 | 8,030 |
en-es | 812,331 | 1,803,738 | 7,582,341 | 492,850 |
en-et | 89,766 | 150,217 | 349,470 | 16,011 |
en-eu | 203,312 | 284,810 | 1,818,402 | 26,922 |
en-fa | 408,741 | 1,998,298 | 5,727,420 | 1,179,780 |
en-fi | 282,776 | 640,176 | 2,839,015 | 131,360 |
en-fo | 10,618 | 16,725 | 40,026 | 8,204 |
en-fr | 1,164,577 | 2,716,241 | 10,395,149 | 576,593 |
en-fy | 22,984 | 53,636 | 71,920 | 483 |
en-ga | 35,742 | 41,172 | 65,875 | 1,035 |
en-gd | 12,701 | 12,508 | 16,845 | 692 |
en-gl | 82,424 | 173,203 | 457,845 | 37,497 |
en-gu | 6,733 | 9,365 | 47,133 | 10,309 |
en-he | 149,652 | 370,751 | 1,027,194 | 28,781 |
en-hi | 50,171 | 56,192 | 263,077 | 75,607 |
en-hr | 106,899 | 165,019 | 849,750 | 48,572 |
en-hsb | 7,829 | 9,264 | 66,803 | 92 |
en-ht | 31,848 | 43,396 | 138,019 | 38 |
en-hu | 286,766 | 588,257 | 2,811,827 | 342,155 |
en-hy | 104,196 | 123,199 | 1,394,821 | 31,961 |
en-ia | 16,091 | 14,839 | 999 | 107 |
en-id | 193,462 | 252,084 | 1,532,013 | 567,642 |
en-io | 26,688 | 24,460 | 90,873 | 385 |
en-is | 28,061 | 29,340 | 96,633 | 9,349 |
en-it | 883,922 | 880,054 | 12,916,139 | 1,537,143 |
en-ja | 459,616 | 1,511,162 | 4,333,202 | 619,562 |
en-jv | 22,049 | 16,890 | 179,467 | 37,269 |
en-ka | 78,075 | 129,477 | 920,729 | 79,745 |
en-kk | 93,549 | 88,637 | 2,279,074 | 16,080 |
en-kn | 10,832 | 6,042 | 92,542 | 24,490 |
en-ko | 230,856 | 819,163 | 1,459,320 | 98,693 |
en-ku | 10,503 | 12,675 | 42,037 | 2,087 |
en-ky | 24,137 | 9,117 | 405,913 | 2,060 |
en-la | 115,149 | 200,265 | 350,765 | 19,563 |
en-lb | 30,999 | 52,757 | 97,523 | 173 |
en-li | 8,767 | 9,040 | 8,879 | 19 |
en-lmo | 30,889 | 8,933 | 424,711 | 8,083 |
en-lt | 101,246 | 124,981 | 985,057 | 101,212 |
en-lv | 52,420 | 107,290 | 552,626 | 124,933 |
en-mg | 72,736 | 74,857 | 286,908 | 195 |
en-min | 23,390 | 18,292 | 83,093 | 30,548 |
en-mk | 58,003 | 100,441 | 434,366 | 107,648 |
en-ml | 33,454 | 40,961 | 321,829 | 93,694 |
en-mn | 13,880 | 21,071 | 142,688 | 23,324 |
en-mr | 31,062 | 36,001 | 274,791 | 24,084 |
en-mrj | 9,354 | 12,689 | 409 | 38 |
en-ms | 201,373 | 186,593 | 2,357,856 | 837,640 |
en-my | 8,926 | 10,060 | 66,039 | 15,793 |
en-mzn | 10,006 | 5,551 | 82,966 | 8,111 |
en-nah | 8,541 | 2,837 | 37,792 | 1,213 |
en-nap | 13,403 | 23,478 | 11,345 | 39 |
en-nds | 19,696 | 32,042 | 71,207 | 674 |
en-ne | 16,382 | 16,158 | 158,432 | 61,844 |
en-new | 21,908 | 3,078 | 138,415 | 47,817 |
en-nl | 781,381 | 1,164,505 | 5,126,717 | 835,924 |
en-nn | 104,700 | 224,615 | 912,389 | 55,215 |
en-no | 307,842 | 894,128 | 1,976,148 | 75,057 |
en-oc | 78,455 | 70,688 | 1,413,692 | 1,443 |
en-or | 9,177 | 11,895 | 89,095 | 33,645 |
en-os | 7,122 | 7,300 | 60,487 | 1,461 |
en-pa | 17,558 | 8,507 | 157,912 | 42,070 |
en-pl | 751,845 | 1,504,459 | 7,917,683 | 472,984 |
en-pms | 57,198 | 55,891 | 612,319 | 96,133 |
en-pnb | 34,231 | 13,071 | 38,643 | 12,523 |
en-pt | 660,417 | 1,415,211 | 5,514,854 | 558,741 |
en-qu | 16,173 | 28,204 | 45,351 | 3,550 |
en-ro | 248,287 | 451,192 | 3,590,140 | 683,828 |
en-ru | 708,594 | 1,597,861 | 8,982,662 | 1,100,867 |
en-sa | 8,014 | 17,629 | 41,573 | 13,506 |
en-sah | 7,031 | 4,815 | 35,263 | 7,780 |
en-scn | 20,984 | 21,571 | 22,583 | 426 |
en-sco | 40,440 | 102,229 | 596,071 | 207,142 |
en-sh | 209,987 | 302,447 | 2,293,914 | 190,468 |
en-si | 8,113 | 1,770 | 61,447 | 17,752 |
en-sk | 150,113 | 225,967 | 1,879,808 | 100,693 |
en-sl | 98,948 | 236,504 | 1,096,815 | 421,950 |
en-sq | 39,881 | 50,179 | 261,983 | 22,605 |
en-sr | 230,941 | 372,975 | 1,535,586 | 266,585 |
en-sah | 7,031 | 4,815 | 35,263 | 7,780 |
en-scn | 20,984 | 21,571 | 22,583 | 426 |
en-sco | 40,440 | 102,229 | 596,071 | 207,142 |
en-sh | 209,987 | 302,447 | 2,293,914 | 190,468 |
en-si | 8,113 | 1,770 | 61,447 | 17,752 |
en-sk | 150,113 | 225,967 | 1,879,808 | 100,693 |
en-sl | 98,948 | 236,504 | 1,096,815 | 421,950 |
en-sq | 39,881 | 50,179 | 261,983 | 22,605 |
en-sr | 230,941 | 372,975 | 1,535,586 | 266,585 |
en-su | 9,407 | 4,942 | 39,573 | 7,045 |
en-sv | 706,492 | 1,505,243 | 6,617,536 | 676,907 |
en-sw | 27,096 | 37,589 | 124,131 | 50,792 |
en-ta | 57,327 | 87,782 | 571,893 | 130,296 |
en-te | 17,156 | 17,000 | 173,949 | 48,393 |
en-tg | 30,285 | 21,488 | 196,643 | 4,512 |
en-th | 77,688 | 162,113 | 860,446 | 236,997 |
en-tl | 51,964 | 46,678 | 193,284 | 49,592 |
en-tr | 192,817 | 651,953 | 1,923,748 | 381,663 |
en-tt | 16,207 | 16,943 | 115,352 | 1,187 |
en-uk | 372,708 | 813,905 | 4,865,331 | 399,708 |
en-ur | 91,216 | 178,833 | 1,750,242 | 691,392 |
en-uz | 88,037 | 58,237 | 2,214,101 | 13,286 |
en-vec | 9,770 | 6,777 | 103,423 | 176 |
en-vi | 427,099 | 751,492 | 6,021,715 | 1,657,255 |
en-vo | 116,967 | 259,396 | 523,047 | 87 |
en-wa | 4,799 | 3,801 | 117 | 25 |
en-war | 270,261 | 157,361 | 1,536,543 | 433,060 |
en-yi | 9,393 | 12,299 | 20,098 | 653 |
en-yo | 29,244 | 27,016 | 92,121 | 28,781 |
en-zh | 497,704 | 834,988 | 5,521,741 | 1,587,672 |