ayberkcal / trnltk-java

Turkish Natural Language Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

""" Copyright 2012-2013 Ali Ok (aliokATapacheDOTorg)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. """

Turkish Natural Language Toolkit

This project provides a toolkit for computer linguistic work for Turkish.

Currently a morphologic parser and a tokenizer is provided. Biggest challenge is providing an ambiguity resolver.

Project first implemented in Python, TRNLTK Python, then Java. Python project is obsolete.

Build Status

See documentation, tutorial and cookbook

News:

Motivation

Why another parsing tool and why FSM?

I've inspected other other approaches and I saw that tracking the problems are very hard with them. For example, one approach is creating a suffix graph by defining what suffix can come after other suffix. But with that approach it is impossible to have an overview of the graph, since there would be thousands of nodes and edges.

See documentation for more information.

Phonetic rules and phonetic implementation are similar to from open-source java library Zemberek3.

How it is tested?

There are thousands of parsing unit tests. Plus, I use the treebank from METU-Sabanci, but is closed-source. Unfortunately, its license doesn't allow anyone to publish any portion of the treebank, thus I only test the parser against it in my local environment.

Plan

  1. DONE: Get rid of unused stuff as much as possible. Such as * suffix based parsing (deprecated by form based parsing)
  2. DONE: Fix the build!
  3. SKIPPED Write a graphical tool to build suffix graphs. * Support editing a suffix graph * Show form-based graph immediately * Always save backups upon edit! * Show related portion of form-based graph whenever a suffix is selected in suffix-based graph
  4. Prepare for reducing ambiguity in suffix graph. Fill reducedAmbiguity.ContextlessMorphologicParserBasicSuffixGraphTest and remove @Ignore annotation
  5. Reduce ambiguity in suffix graph. E.g. discard parse results like * "sokakları", "sokak(sokak)+Noun+A3pl(lAr[lar])+Pnon+Acc(+yI[ı])", "sokak(sokak)+Noun+A3pl(lAr[lar])+P3sg(+sI[ı])+Nom", "sokak(sokak)+Noun+A3pl(lAr[lar])+P3pl(!I[ı])+Nom", "sokak(sokak)+Noun+A3sg+P3pl(lAr!I[ları])+Nom" * ...
  6. Write a disambiguator with hunch-based parameters and metrics
  7. Use machine learning techniques to determine metrics and parameters
  8. Apply ideas: * Ambiguity classification (to apply similar disambiguation techniques in case of a similar disambiguity) * Critical surface tagging (solve "easy-win"s) * Implement a parse tree for POS tagging * Integrate disambiguation and POS tagging * Proper noun identification *

About

Turkish Natural Language Toolkit


Languages

Language:Java 97.8%Language:HTML 1.1%Language:Python 1.0%Language:CSS 0.0%Language:MATLAB 0.0%