The-Moroccan-News-Corpus
The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:
Moroccan News websites
Every folder represents the project folder for every newspaper. How to use spiders/crawlers?
To scrape any data from any of the newspapers above,
scrapy crawl < name of the spider > -o < name of the file >.json
Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line. Note
This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF