urdu-nlp urdu-language urdu urdu-text-processsing python python3 stemming stemming-algorithm

Urdu Stemmer

It is a python based urdu stemmer. From a given list of words, it will try to find their stems using a limited list of affixes given in the program.

stemmer.py: This file contains the logic and implementation of the stemmer. It uses regular expressions to find prefixes at the start of a word and suffixes at the end of the word.

Following are the list of (currently present) affixes:

urduPrefixes = ['بے', 'بد', 'لا', 'ے', 'نا', 'با', 'کم', 'ان', 'اہل', 'کم']
urduSuffixes = ['دار', 'وں', 'یاں', 'یں', 'ات', 'گوار', 'ور', 'پسند']

To find a prefix it uses this regular expresseion:

checkPrefix = re.search(rf'\A{prefix}', urduWord)

To find a suffix it uses this regular expression:

checkSuffix = re.search(rf"{suffix}\Z", urduWord)

urdu-affixes.txt: This file contains the input words for the stemmer.py. It contains two colloums and are read from urdu way of reading files (right to left).

The words on the most right act as a input for the program. The stemmer reads them and finds their stems.
The words on the most left are the actual stem words of words on the right side. These are wriiten manuually to calcaulate the efficency/accuracy of the program i.e. How many stem words the program calculated right?

About

A simple python based Urdu stemmer which tries to find a stem word from a list of affixes.

urdu-nlp urdu-language urdu urdu-text-processsing python python3 stemming stemming-algorithm

Languages

Language:Python 100.0%