jojolebarjos / wiktionary-phoneme

Extract phonemes and words from Wiktionary dumps

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wiktionary IPA Dataset

First, download the desired Wiktionary dump, from this repository. For instance, here are the latest links for some languages:

Then, run the following command to extract the IPA data:

python -m extract frwiktionary-latest-pages-articles.xml.bz2 fr.tsv

Note that you can disable the cleaning step to get all detected entries:

python -m extract -r frwiktionary-latest-pages-articles.xml.bz2 fr.raw.tsv

The output is easily loaded and processed using pandas:

import pandas as pd

df = pd.read_csv("fr.tsv", sep="\t", na_filter=False)

df = df.sort_values(["text", "pronunciation", "language"])
df = df.drop_duplicates()
df.to_csv("fr.sorted.tsv", index=False, sep="\t", encoding="utf-8", line_terminator="\n")

About

Extract phonemes and words from Wiktionary dumps

License:The Unlicense


Languages

Language:Jupyter Notebook 80.8%Language:Python 19.2%