ashim95 / Malayalam-Newspaper-Article-Dataset

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that can check new tools in future like malaylam stemmer, stopwords removal, lemmatizers, etc...

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Malayalam-Newspaper-Article-Dataset

Project does web scraping. It scraps articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...

DATASET

Directly download the Datset from Dropbox

OR

Execution

Open the terminal (Ctrl+Alt+T) and execute the given commands

git clone https://github.com/ABHISHEKVALSAN/Malayalam-Newspaper-Article-Dataset 
cd Malayalam-Newspaper-Article-Dataset
mkdir DataSet 
pip install -r requirements.txt 
python3 MalayalamScraping.py 

PS

  1. After running the last command, you'll see files being created in the DataSet directory
  2. Lot of urls have files missing... It is usual
  3. The scraping is website specific and hence donot work for other newspaper sites.

Project Requirements

  1. Python
  2. Pip installed
  3. Internet connection

Contact me at email given below for assistance or raise an issue.

Email : abhiavk@iitk.ac.in

Related Works

A similar repo with Telugu DataSet can be found here.

About

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that can check new tools in future like malaylam stemmer, stopwords removal, lemmatizers, etc...


Languages

Language:Python 100.0%