elsayed-issa / The-Moroccan-News-Corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The-Moroccan-News-Corpus

The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:

Moroccan News websites

  • http://ahdath.info/
  • https://www.akhbarona.com/
  • https://www.alayam24.com/
  • https://www.almaghribtoday.net/
  • https://www.barlamane.com/
  • https://dalil-rif.com/
  • https://www.febrayer.com/
  • https://www.goud.ma/
  • https://www.hespress.com/
  • https://ar.hibapress.com/
  • http://kifache.com/
  • www.maghress.com
  • https://www.menara.ma/
  • https://www.almaghreb24.com/
  • https://maroctelegraph.com/
  • https://www.nadorcity.com/
  • https://tanja24.com/
  • http://telexpresse.com/
  • http://ar.le360.ma/
  • http://www.alyaoum24.com/
  • http://www.2m.ma/ar/
  • https://ar.yabiladi.com/
  • How to use spiders/crawlers?

    Every folder represents the project folder for every newspaper.
    To scrape any data from any of the newspapers above,
  • Download its project folder.
  • On the command line, change directory to the project folder
  • Invoke the following command to start scrabing the website: scrapy crawl < name of the spider > -o < name of the file >.json
  • Note

    Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.
    This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF

    About


    Languages

    Language:Python 100.0%