UUDigitalHumanitieslab / compound-splitter

Wrapper and evaluation service for multiple Dutch compound splitters

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compound Splitter

DOI

This is a basic wrapper for multiple Dutch compound splitters. The purpose of this wrapper is to:

  • provide a unified API for multiple compound splitters. The package offers a simple socket server and a Flask application for this purpose.
  • evaluate the accuracy of different compound splitters

Intended audience

The package was initially developed for T-scan, a natural language analysis application intended for research. For T-scan, we required that users could choose between different algorithms (hence the need for a unified API), and some evaluation of the quality of those algorithms.

The resulting package is useful if you want to run a compound splitting service (e.g. as part of an API or web application), or if you want to evaluate compound splitter methods. Adding new methods, even ones that are not python packages, should be feasible if you have programming experience.

If you are looking for a simple, lightweight python package for compound splitting, this is not it. compound-word-splitter may be a good alternative for you.

Compound splitting methods

The following compound splitters are included:

As a baseline, we also include a "never" algorithm, which never splits.

Requirements

  • Python 3.6+
  • Java (only required for MCS)

Installation

Installing with pip

compound-splitters-nl is available as a python package, which includes all the data for all included compound splitter methods. This complete package is too large to be registered on PyPI, but you can download the package from our releases.

The archived package can be installed via pip by installing the local file:

pip install compound-splitters-nl-*.tar.gz
# or substitute with your file path

If you want to use the web API, you will need to install additional dependencies:

pip install compound-splitters-nl-*.tar.gz[web_api]

Installing from source code

You can also clone the source code repository. In this case, you will still need to download and unpack the data needed for the compound splitter methods. Run installation with:

pip install -r requirements.txt
python retrieve.py
python prepare.py

Tests

python -m unittest discover tests/

Evaluate Different Compound Algorithms

This will evaluate the different algorithms using the reference files in test_sets .

python -m compound_splitter.evaluate

Run Web API

python -m compound_splitter.api_web

JSON Interface

GET /list

Lists the splitting methods.

GET /split/<method_name>/<compound>

Splits the compound using the specified method.

Run Simple Socket Server

python -m compound_splitter.socket_server
$ telnet localhost 7005
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
bedrijfsaansprakelijkheidsverzekering,secos
bedrijfs,aansprakelijkheids,verzekeringConnection closed by foreign host.

About

Wrapper and evaluation service for multiple Dutch compound splitters

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%