classifier naive-bayes-classifier nepali nepali-news nlp svm-classifier text-categorization text-classification tf-idf

16NepaliNews Corpus

The '16 Nepali News' data set is a collection of approximately 14,364 Nepali language news documents, partitioned (unevenly) across 16 different newsgroup: Auto, Bank, Blog, Business Interview, Economy, Employment, Entertainment, Interview, Literature, National News, Opinion, Sports, Technology, Tourism, and World.

This '16 Nepali News' data set was inspired from 20 newsgroups dataset.

Loading the Corpus

MLCOMPDIR = r'LOCATION OF CORPUS'

trainNews = load_mlcomp('16NepaliNews', 'train', mlcomp_root= MLCOMPDIR)
testNews = load_mlcomp('16NepaliNews', 'test', mlcomp_root= MLCOMPDIR)

Or Manually Preparing Training and Test Set

news = load_mlcomp('16NepaliNews', 'raw', mlcomp_root= MLCOMPDIR)

''' Testing and Training Data '''
SPLIT_PERCENT = 0.9

splitSize = int(len(news.data) * SPLIT_PERCENT)
print(splitSize)
xTrain = news.data[:splitSize]
xTest = news.data[splitSize:]
yTrain = news.target[:splitSize]
yTest = news.target[splitSize:]

Executing the code

Before execution, copy the file 'nepali' to the stop words directory of your nltk-data/corpora folder.

License

This '16NepaliNews' corpus is licensed under GPLv3

Author

sndsabin

This Corpus was developed by parsing and scrapping contents published from 2015 on different online news portals. All the news contents belong to their respective owners.

About

Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.

classifier naive-bayes-classifier nepali nepali-news nlp svm-classifier text-categorization text-classification tf-idf

GNU General Public License v3.0

Languages

Language:Python 100.0%