makrai / awesome-hungarian-nlp

A curated list of NLP resources for Hungarian

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome NLP Resources for Hungarian Awesome

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers - GyΓΆrgy Orosz

Table of contents

  1. Tools
  2. Datasets
  3. Journals / Conferences / Institutes / Events
  4. Courses / Tutorials
  5. Blogs / Communities

1. Tools

Notations:

  • πŸ‘Œ Easy to install and use
  • πŸš€ Commercial-friendly license
  • πŸ’― Pretrained models are available or not needed

Word tokenization, sentence splitting

  • huntoken πŸ‘ŒπŸš€πŸ’― Hungarian word and sentence splitter
  • quntoken πŸ‘ŒπŸš€πŸ’― New Hungarian tokenizer based on quex, huntoken

Morphology

  • emMorph (Humor) πŸ’― Hungarian morphological analyzer based on Humor
  • hunmorph πŸš€πŸ’― is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.
  • hunmorph-foma πŸš€πŸ’― Hungarian morpholical analyzer and generator based on hunmorph.
  • hunspell πŸ‘ŒπŸš€πŸ’― is an open-source spell-checker, stemmer and morphological analyzer
  • lara-hungarian-nlp πŸ‘ŒπŸš€πŸ’― LARA is a lightweight Python NLP library for ChatBots in Hungarian.
  • Lemmagen πŸ‘ŒπŸš€πŸ’― project aims at providing standardized open source multilingual platform for lemmatisation. (Python package for v2 | C# project for v3)

PoS / Morphological taggers

  • hunpos πŸ‘ŒπŸš€πŸ’― Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
  • PurePos πŸ‘ŒπŸš€ Open source morphological tagger based on HunPos
  • purepos.py πŸ‘ŒπŸš€ Python wrapper for PurePos

Taggers / Chunkers

  • HunTag πŸ‘ŒπŸš€ A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
  • HunTag3 πŸ‘ŒπŸš€ Improved version of the original HunTag
  • SzegedNER πŸ‘ŒπŸš€πŸ’― Named Entity Recognition tool for Hungarian and English
  • DBpedia Spotlight πŸ‘ŒπŸš€πŸ’― DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text. Docker image

Pipelines with Hungarian NLP components

  • magyarlanc πŸ‘ŒπŸ’― A toolkit for the basic linguistic processing of Hungarian
  • magyarlanc_spark πŸ‘ŒπŸ’― Spark wrapper for magyarlanc
  • spaCy πŸ‘ŒπŸš€πŸ’― Industrial-strength Natural Language Processing (NLP) with Python and Cython (Hungarian models)
  • huNLP πŸ‘ŒπŸ’― Unified Java and REST API for magyarlanc and szegedNER
  • hunlp-GATE πŸ’― GATE plugin containing Hungarian NLP tools as GATE processing resources
  • Trendminer Hungarian Processing Pipeline πŸš€ Hungarian NLP pipeline for social media text analysis (TrendMiner project)
  • Google Syntaxnet πŸš€πŸ’― Neural Models of Syntax
  • UDPipe πŸ‘ŒπŸš€πŸ’― is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
  • polyglot πŸ‘ŒπŸš€πŸ’― is a natural language pipeline that supports massive multilingual applications.

Syntactic parsers

  • hunpars πŸš€πŸ’― A rule based Hungarian syntactical analyzer
  • HunParse πŸš€πŸ’― An NLTK-based parser using KR-style morphological annotation
  • Anagramma Parser A parser based on psycholinguistics principles

Semantic analysis

  • SentimentAnalysisHUN πŸ‘ŒπŸš€πŸ’― is an open-source sentiment analysis tool for Hungarian language, written in Python.

Other

  • emLam πŸ‘ŒπŸš€πŸ’― Preprocessing scripts for Hungarian Language Modeling
  • pywnxml πŸ‘ŒπŸš€πŸ’― Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)

2. Datasets

Corpora

  • Hungarian Webcorpus With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license.
  • emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
  • Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.
  • web2corpus Automatically create multilingual web corpus
  • CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe, together with word embeddings of dimension 100 computed from lowercased texts by word2vec
  • OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
  • The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.
  • Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
  • Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.
  • Universal Dependencies
  • Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.
  • hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
  • Mazsola database containes 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
  • Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
  • HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.
  • Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.
  • SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
  • HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.
  • CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl

Word vectors

Linguistic Resources

  • morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
  • huwn Hungarian Wordnet
  • Hungarian Sentiment Lexicon The dictionaries were manually created on the basis of Wordnet-Affect lexicons.
  • 4lang Concept dictionary using Eilenberg machines
  • Named Entity lists for Hungarian
  • Mazsola ISZ lists 500K verb frames extracted from the Mazsola database
  • Manocska merges verb frames existing databases

Linked Open Data

3. Journals / Conferences / Institutes / Events

Journals

Conferences

Institutes

4. Courses / Tutorials

TBD

5. Blogs / Communities

About

A curated list of NLP resources for Hungarian