jqhoogland / wiktionary-api

Python client to read & parse wiktionary entries & their semantic relations.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wiktionary API

This is a non-official Wiktionary API client written in Python. It is very non-stable. Expect many breaking changes.

I'm actually focusing on building out a Typescript client instead, as part of Open Dictionary. If you want me to put more work in this, let me know. For now, it has fulfilled its role as sandbox.

❓ How it works

Wiktionary articles contain lots of internal structure in the form of templates. With a bit of manual pruning, we convert these templates into Wiktionary-agnostic "semantic triples". (These use a new ontology for natural language).

(Note: they're not perfect triples yet.)

Then, with SPARQL and tools built on top of it (GraphQL-LD), we can query the data consistently.

In actual fact, there's a bit more going on: we look at context (so section headings & also crawl standard wikilinks and non-templated information).

The Problem

There is an actual Wiktionary API.

Unfortunately, it only returns pages in html or wikitext (Wikipedia's internal markup language), which is not the friendliest for computers to read.

Fortunately, wikitext has a decent bit of internal structure and is easy not impossible to work with.

This library provides a client that uses wikitextparser (plus some custom logic) to convert wikitext into straightforward json objects (read: Python dictionaries) that you can use at your leisure.

It's much more comprehensive and accurate than WiktionaryParser since it works with the raw wikitext rather than the generated html.

One Wiktionary to rule them all.

There's one problem: Wiktionary is not one thing.

No there are 183 different Wiktionaries in 183 different languages. Pretty much every single Wiktionary has its own, non-interoperable standards. So you have to have to build unique parsing logic for each Wiktionary.

(These are identified by the an ISO 639 code subdomain (as in en.wiktionary.com))

So... we're starting with the English-language Wiktionary. Depending on how that goes, we might move on to additional wiktionaries.

Structure

The current return structure is as follows:

{
  word: string,
  lang: string, // ISO language code
  altForms: LinkedWord[],
  etymology: LinkedWord[],
  pronunciations: (Pronunciation | Qualifier)[][],
  definitions: LinkedWord   
}[]

There is an entry for each

Example

Using the CLI to retrieve the English entries for "foo"

python wiktionary/cli.py foo en

Returns the following object, a list of entries (one for each unique etymology):

[
  {
    "word": "foo",
    "lang": "en",
    "altForms": [],
    "etymology": [
      {
        "@id": "derived",
        "lang": "en",
        "srcLang": "cmn",
        "src": "",
        "transliteration": ""
      },
      {
        "@id": "label",
        "lang": "en",
        "2": "historical",
        "3": "obsolete"
      }
    ],
    "pronunciations": [
      [
        {
          "@id": "dialect",
          "dialects": [
            "UK"
          ]
        },
        {
          "@id": "ipa",
          "lang": "en",
          "pronunciations": [
            {
              "ipa": "/fuː/"
            }
          ]
        }
      ],
      [
        {
          "@id": "En-au-foo.ogg",
          "lang": "en",
          "url": "Audio (AU)"
        }
      ],
      [
        {
          "@id": "rhymes",
          "lang": "en",
          "s": "1",
          "rhymes": [
            {
              "rhyme": ""
            }
          ]
        }
      ],
      [
        {
          "@id": "homophones",
          "lang": "en",
          "rhymes": [
            {
              "homophone": "-fu"
            }
          ]
        }
      ]
    ],
    "glyphOrigin": null,
    "description": null,
    "definitions": [
      {
        "@id": "references",
        "linked": [],
        "category": null,
        "data": "<references/>\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"
      },
      {
        "@id": "anagrams",
        "linked": [
          {
            "@id": "anagram",
            "lang": "en",
            "alphagram": "foo",
            "anagrams": [
              "oof"
            ]
          }
        ]
      },
      {
        "@id": "noun",
        "data": []
      }
    ]
  },
  {
    "word": "foo",
    "lang": "en",
    "altForms": [],
    "etymology": [
      {
        "@id": "derived",
        "lang": "en",
        "srcLang": "zh",
        "src": "",
        "alt": "",
        "gloss": "[[fortunate]]; [[prosperity]], [[good]] [[luck]]",
        "transliteration": ""
      },
      {
        "@id": "mention",
        "lang": "zh",
        "src": "福星",
        "alt": "",
        "gloss": "[[Jupiter]]",
        "transliteration": "Fúxīng"
      },
      {
        "@id": "mention",
        "lang": "en",
        "src": "om mani padme hum"
      },
      {
        "@id": "mention",
        "lang": "en",
        "src": "FUBAR"
      },
      {
        "@id": "label",
        "lang": "en",
        "2": "programming"
      },
      {
        "@id": "label",
        "lang": "en",
        "2": "fandom slang"
      },
      {
        "@id": "link",
        "lang": "en",
        "src": "foobar"
      },
      {
        "@id": "link",
        "lang": "en",
        "src": "FUBAR"
      }
    ],
    "pronunciations": [
      [
        {
          "@id": "dialect",
          "dialects": [
            "UK"
          ]
        },
        {
          "@id": "ipa",
          "lang": "en",
          "pronunciations": [
            {
              "ipa": "/fuː/"
            }
          ]
        }
      ],
      [
        {
          "@id": "En-au-foo.ogg",
          "lang": "en",
          "url": "Audio (AU)"
        }
      ],
      [
        {
          "@id": "rhymes",
          "lang": "en",
          "s": "1",
          "rhymes": [
            {
              "rhyme": ""
            }
          ]
        }
      ],
      [
        {
          "@id": "homophones",
          "lang": "en",
          "rhymes": [
            {
              "homophone": "-fu"
            }
          ]
        }
      ]
    ],
    "glyphOrigin": null,
    "description": null,
    "definitions": [
      {
        "@id": "references",
        "linked": [],
        "category": null,
        "data": "<references/>\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"
      },
      {
        "@id": "anagrams",
        "linked": [
          {
            "@id": "anagram",
            "lang": "en",
            "alphagram": "foo",
            "anagrams": [
              "oof"
            ]
          }
        ]
      },
      {
        "@id": "noun",
        "data": [
          {
            "@id": "derivedTerms",
            "linked": [
              {
                "@id": "link",
                "lang": "en",
                "src": "foobar"
              }
            ]
          },
          {
            "@id": "relatedTerms",
            "linked": [
              {
                "@id": "link",
                "lang": "en",
                "src": "FUBAR"
              }
            ]
          }
        ]
      },
      {
        "@id": "derivedTerms",
        "linked": [
          {
            "@id": "link",
            "lang": "en",
            "src": "foobar"
          }
        ]
      },
      {
        "@id": "relatedTerms",
        "linked": [
          {
            "@id": "link",
            "lang": "en",
            "src": "FUBAR"
          }
        ]
      }
    ]
  },
  {
    "word": "foo",
    "lang": "en",
    "altForms": [],
    "etymology": [
      {
        "@id": "mention",
        "lang": "en",
        "src": "fuck"
      },
      {
        "@id": "sense",
        "sense": "expression of disgust"
      },
      {
        "@id": "link",
        "lang": "en",
        "src": "darn"
      },
      {
        "@id": "link",
        "lang": "en",
        "src": "drat"
      }
    ],
    "pronunciations": [
      [
        {
          "@id": "dialect",
          "dialects": [
            "UK"
          ]
        },
        {
          "@id": "ipa",
          "lang": "en",
          "pronunciations": [
            {
              "ipa": "/fuː/"
            }
          ]
        }
      ],
      [
        {
          "@id": "En-au-foo.ogg",
          "lang": "en",
          "url": "Audio (AU)"
        }
      ],
      [
        {
          "@id": "rhymes",
          "lang": "en",
          "s": "1",
          "rhymes": [
            {
              "rhyme": ""
            }
          ]
        }
      ],
      [
        {
          "@id": "homophones",
          "lang": "en",
          "rhymes": [
            {
              "homophone": "-fu"
            }
          ]
        }
      ]
    ],
    "glyphOrigin": null,
    "description": null,
    "definitions": [
      {
        "@id": "references",
        "linked": [],
        "category": null,
        "data": "<references/>\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"
      },
      {
        "@id": "anagrams",
        "linked": [
          {
            "@id": "anagram",
            "lang": "en",
            "alphagram": "foo",
            "anagrams": [
              "oof"
            ]
          }
        ]
      },
      {
        "@id": "interjection",
        "data": [
          {
            "@id": "synonyms",
            "linked": [
              {
                "@id": "sense",
                "sense": "expression of disgust"
              },
              {
                "@id": "link",
                "lang": "en",
                "src": "darn"
              },
              {
                "@id": "link",
                "lang": "en",
                "src": "drat"
              }
            ]
          }
        ]
      },
      {
        "@id": "synonyms",
        "linked": [
          {
            "@id": "sense",
            "sense": "expression of disgust"
          },
          {
            "@id": "link",
            "lang": "en",
            "src": "darn"
          },
          {
            "@id": "link",
            "lang": "en",
            "src": "drat"
          }
        ]
      }
    ]
  },
  {
    "word": "foo",
    "lang": "en",
    "altForms": [
      {
        "word": {
          "@id": "link",
          "lang": "en",
          "src": "foo'"
        },
        "qualifiers": []
      }
    ],
    "etymology": [
      {
        "@id": "link",
        "lang": "en",
        "src": "foo'"
      },
      {
        "@id": "label",
        "lang": "en",
        "2": "slang"
      }
    ],
    "pronunciations": [
      [
        {
          "@id": "dialect",
          "dialects": [
            "UK"
          ]
        },
        {
          "@id": "ipa",
          "lang": "en",
          "pronunciations": [
            {
              "ipa": "/fuː/"
            }
          ]
        }
      ],
      [
        {
          "@id": "En-au-foo.ogg",
          "lang": "en",
          "url": "Audio (AU)"
        }
      ],
      [
        {
          "@id": "rhymes",
          "lang": "en",
          "s": "1",
          "rhymes": [
            {
              "rhyme": ""
            }
          ]
        }
      ],
      [
        {
          "@id": "homophones",
          "lang": "en",
          "rhymes": [
            {
              "homophone": "-fu"
            }
          ]
        }
      ]
    ],
    "glyphOrigin": null,
    "description": null,
    "definitions": [
      {
        "@id": "references",
        "linked": [],
        "category": null,
        "data": "<references/>\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"
      },
      {
        "@id": "anagrams",
        "linked": [
          {
            "@id": "anagram",
            "lang": "en",
            "alphagram": "foo",
            "anagrams": [
              "oof"
            ]
          }
        ]
      },
      {
        "@id": "noun",
        "data": []
      }
    ]
  }
]

About

Python client to read & parse wiktionary entries & their semantic relations.

License:GNU General Public License v3.0


Languages

Language:Python 100.0%