pavelsivanovs / lv-neologism-detector

An NLP system for detecting syntactic neologisms in Latvian language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

šŸ‡±šŸ‡» Automatic Neologism Detector

Automatic neologism detector for Latvian language. Author: Pavels Ivanovs

Description

This is Pavels Ivanovs' project for bachelor thesis Automatic Neologism Detection [1].

The goal of this project is to create a NLP tool which extracts from the submitted text words which are most likely to be included into the vocabulary of Latvian language, specifically, Tēzaurs.lv: the biggest publicly available thesaurus of Latvian language.

Methodology

Two main approaches are used to achieve the goal of the project:

  1. Exclusion lists. Words from the input text are being filtered out if their lemmas are located in the vocabulary. Lemmatization functionality provided by LVTagger and NLP-PIPE.
  2. Classification by machine-learning model. Classification using neural network. Input features, like word length, Levenshtein distance to the closest vocabulary entry, are being extracted from the word which are being fed to the neural network which outputs a possibility of a word being included into the vocabulary.

Results

After training the model its efficiency is as follows (x-axis: batch number; y-axis: metric):

Testing metrics of the model

  • Accuracy (PareizÄ«ba): 77.86%
  • Precision (Precizitāte): 40.56%
  • Recall (Pārklājums): 61.73%
  • F-score (F-mērs): 46.80%

Based on the metrics received from testing the model it is seen that there are still ways to improve the efficiency of the model. Two main options: optimization of the dataset (oversampling and overall increase of records) and model optimization, including neural network strucure changes and additional experimenting with epoch number and learning rate.

Requirements

  • Python v3.10
  • Docker compose

References

[1] P. Ivanovs, "Jaunvārdu automātiska atpazīŔana," Bakalaura darbs, Datorikas fakultāte, Latvijas Universitāte, Rīga, Latvija, 2023

About

An NLP system for detecting syntactic neologisms in Latvian language.

License:MIT License


Languages

Language:Python 95.5%Language:Shell 2.3%Language:PLpgSQL 2.2%