DA7OUD / hazm

Persian NLP Toolkit

Home Page:https://www.roshan-ai.ir/hazm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hazm

Tests PyPI - Downloads PyPI - Python Version GitHub

Python library for digesting Persian text.

  • Text cleaning
  • Sentence and word tokenizer
  • Word lemmatizer
  • POS tagger
  • Shallow parser
  • Dependency parser
  • Interfaces for Persian corpora
  • NLTK compatible

Documentation

Visit https://roshan-ai.ir/hazm/docs to view the full documentation.

Modules accuracy

Module name accuracy
Lemmatizer 89.9%
Chunker 93.4% download pre-trained model
POSTagger 97.2%
universal: 98.8%
download pre-trained model
DependencyParser 97.1% download pre-trained model

Installation

The latest stable version of Hazm can be installed through pip:

pip install hazm

But for testing or using Hazm with the latest updates you may use:

pip install https://github.com/roshan-research/hazm/archive/master.zip --upgrade

Usage

>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> stemmer = Stemmer()
>>> stemmer.stem('کتاب‌ها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='resources/pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> chunker = Chunker(model='resources/chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

Hazm in other languages

Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..

  • JHazm: A Java port of Hazm
  • NHazm: A C# port of Hazm

Contribution

We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!

Thanks

Code contributores

Alt

Others

  • Thanks to Virastyar project for providing the persian word list.

Star History Chart

About

Persian NLP Toolkit

https://www.roshan-ai.ir/hazm/

License:MIT License


Languages

Language:Python 100.0%