BIOnBijan

Bijankhan Corpus

The Bijankhan corpus (Persian: پیکرهٔ بی‌جن‌خان‎) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags.

The Bijankhan corpus was created by the Database Research Group at the University of Tehran.Database Research Group The corpus is non-free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named after Mahmood Bijankhan, professor of linguistics at the University of Tehran due to his contributions in this area.

Bigram

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text.

AddOneSmoothing

Both given codes calculate the bigram probability of your Persian text based on the Bijan Khan corpus. One with 'Add One Smoothing' and another without.

iamjalipo / BIOnBijan

BIOnBijan

Bijankhan Corpus

Bigram

AddOneSmoothing

About

Languages