crawler news tasnim text-classification machine-learning python scrapy

Crawler

Open source crawler for Persian websites. Crawled websites to now:

Asriran

asriran/run_asriran.sh

You can change some paramters in this crawler. See run_asriran.sh.

Fa-Wikipedia

Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.

wikipedia/run_wikipedia.sh

Tasnim News

This crawler saves tasnim news pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.

We have a parameter Called Number_of_pages in tasnim.py which controls how many pages we should crawl in each category.

tasnim/run_tasnim.sh

Datasets are all available for download at Kaggle.

CSS selectors are mostly extracted via Copy Css Selector.

About

Open source crawler for Persian websites.

https://www.kaggle.com/amirpourmand/datasets

crawler news tasnim text-classification machine-learning python scrapy

MIT License

Languages

Language:Python 82.1%Language:Shell 17.9%