crawlers news-websites corpus spider scrapy

The-Moroccan-News-Corpus

The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:

Moroccan News websites

http://ahdath.info/

https://www.akhbarona.com/

https://www.alayam24.com/

https://www.almaghribtoday.net/

https://www.barlamane.com/

https://dalil-rif.com/

https://www.febrayer.com/

https://www.goud.ma/

https://www.hespress.com/

https://ar.hibapress.com/

http://kifache.com/

www.maghress.com

https://www.menara.ma/

https://www.almaghreb24.com/

https://maroctelegraph.com/

https://www.nadorcity.com/

https://tanja24.com/

http://telexpresse.com/

http://ar.le360.ma/

http://www.alyaoum24.com/

http://www.2m.ma/ar/

https://ar.yabiladi.com/

How to use spiders/crawlers?

Every folder represents the project folder for every newspaper.
To scrape any data from any of the newspapers above,

Download its project folder.

On the command line, change directory to the project folder

Invoke the following command to start scrabing the website: scrapy crawl < name of the spider > -o < name of the file >.json

Note

Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.
This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF

About

crawlers news-websites corpus spider scrapy

Languages

Language:Python 100.0%