battletrout / arangodb-wordnet-english

Creating an ArangoDB property graph of the Open English WordNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make a property graph of Open English WordNet in ArangoDB

This repo builds on the Open English WordNet repository at https://github.com/globalwordnet/english-wordnet by adding scripts to create a WordNet graph in ArangoDB. The scripts are located in the scripts/arango directory and only require the wn.xml file and a running ArangoDB instance to run, but the rest of the repository creates the wn.xml file.

image

How to use

  1. Initialize an ArangoDB Database (recommend https://hub.docker.com/_/arangodb), this was done with v3.10
  2. Run the two scripts from the Open English Wordnet team in the Usage section below to create the 'wn.xml' file that contains the entirety of the wordnet
  3. Create a user in your ArangoDB with read/write (to use default connection string, user:"wordnet_user" pw:"")
  4. Edit your credentials in arango_connect.py
  5. Import create_wn_graph_arango.py and run (took mine about 40 mins)
  6. Open the ArangoDB web GUI and create a graph with the resulting collections.

Deviations from WordNet in the resulting ArangoDB:

  1. SenseIDs with disallowed characters are approximated to allowed characters (so far, ('`',"'") and ('ñ','n')). The unaltered SenseIDs are stored in the "id" parameter of the node. Per ArangoDB documentation https://www.arangodb.com/docs/stable/data-modeling-naming-conventions-document-keys.html, IDs: ''' must consist of the letters a-z (lower or upper case), the digits 0-9 or any of the following punctuation characters: _ - : . @ ( ) + , = ; $ ! * ' % '''

To-do list:

  • Add the capability to decide the direction of relationships ("has_member" instead of "member_of") depending on application
  • Add the different relationship types as different Edge collections, in case I want to only use certain sets of relationships in the graph, or create different graphs that deal with different relationships (i.e. synset and sense only to simplify traversal)
  • Use Global Wordnet Association's XML forrmat DTD from http://globalwordnet.github.io/schemas/#xml
  • Create a script to write WordNet to Neo4j databases

Below is all from the original Open English WordNet repo at the time of fork:

Open English WordNet

Open English WordNet is a lexical network of the English language grouping words into synsets and linking them according to relationships such as hypernymy, antonymy and meronymy. It is intended to be used in natural language processing applications and provides deep lexical information about the English language as a graph.

Open English WordNet is a fork of the Princeton Wordnet developed under an open source methodology. The quality and veracity of the resource may differ from the Princeton WordNet and we welcome contributions. Contributions to this wordnet may eventually be incorporated into future releases of Princeton WordNet. Correspondance to previous versions and wordnets in other language is provided through the Collaborative Interlingual Index (CILI). The Open English WordNet is available as individual files in GWN-LMF format.

Releases

Open English WordNet is released through the Open English WordNet website. The versions released are

Usage

To compile these into a single file please use the following script(s)

python scripts/from-yaml.py
python scripts/merge.py

This will create a file at wn31.xml that contains the complete wordnet.

Further conversions are available through the converter here.

Changes

We welcome changes, to make a change please read our contributing guidelines and make a pull request.

Open English WordNet is a high-quality resource that acts as a gold-standard for natural language processing, as such we cannot accept any automatically generated results that have not been manually validated.

Please be aware that we use the Global WordNet Association LMF and please read the guidelines for using the format

License

WordNet is released under CC-BY 4.0

References

The canonical citation for English Wordnet is:

More recent papers describing it include:

It incorporates material from:

  • Christiane Fellbaum, editor (1998) WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA.
  • Merrick Choo Yeu Herng and Francis Bond (2021) Taboo wordnet. In Proceedings of the 11th Global Wordnet Conference (GWC2021), University of South Africa (UNISA).

Contributors

  • John P. McCrae
  • Alexandre Rademaker
  • Ewa Rudnicka
  • Bernard Bou
  • Daiki Nomura
  • David Cillessen
  • Ciara O'Loughlin
  • Cathal McGovern
  • Francis Bond
  • Eric Kafe
  • Michael Wayne Goodman
  • Merrick Choo Yeu Herng
  • Enejda Nasaj

About

Creating an ArangoDB property graph of the Open English WordNet

License:Other


Languages

Language:Python 99.3%Language:TeX 0.7%