The '16 Nepali News' data set is a collection of approximately 14,364 Nepali language news documents, partitioned (unevenly) across 16 different newsgroup: Auto, Bank, Blog, Business Interview, Economy, Employment, Entertainment, Interview, Literature, National News, Opinion, Sports, Technology, Tourism, and World.
This '16 Nepali News' data set was inspired from 20 newsgroups dataset.
MLCOMPDIR = r'LOCATION OF CORPUS'
trainNews = load_mlcomp('16NepaliNews', 'train', mlcomp_root= MLCOMPDIR)
testNews = load_mlcomp('16NepaliNews', 'test', mlcomp_root= MLCOMPDIR)
news = load_mlcomp('16NepaliNews', 'raw', mlcomp_root= MLCOMPDIR)
''' Testing and Training Data '''
SPLIT_PERCENT = 0.9
splitSize = int(len(news.data) * SPLIT_PERCENT)
print(splitSize)
xTrain = news.data[:splitSize]
xTest = news.data[splitSize:]
yTrain = news.target[:splitSize]
yTest = news.target[splitSize:]
Before execution, copy the file 'nepali' to the stop words directory of your nltk-data/corpora folder.
This '16NepaliNews' corpus is licensed under GPLv3