moocs learning-analytics dicussion-forum education-technology education-data-mining coursera coursera-discussion-forums natural-language-processing web-scraping corpus-data nlp-corpus

Coursera-Crawler

A crawler to scrape Coursera's discussion forum.

High-level code flow is documented here.

Enviroment setup

1.1 Python 2.7 (We recommend to set it up with anaconda)

1.2 Install the packages specified in requirements.txt
```
   pip install -r requirements.txt
```

For non-root users, please refer to PycURL and Non-root users to use pip to setup your environment.

1.3 Download Phantomjs here for Windows, here for MacOS and here for Linux-64 bit( Other references), install it and add the path to 'phantomjsPath' of config.yml.

config.yml

<1> "UserName" is the username for coursera account;

<2> "Password" is the password for coursera account;

<3> "UserId" is the ID for every account.
```
 First, you have to login using your coursera account on the website. Then press F12, select "Network" and choose XHR. 
 
 You can see an API link "https://www.coursera.org/api/openCourseMemberships.v1/?q=findByUser&userId=XXX", your userid is "XXX".
```
<4> "filePath" is the path to save the data you crawled(Choose or make your preferred saving path).

<5> "activeCoursePageNum" is the maximum pages of your "Last Active" courses you want to crawl.(When you login, you will see "My Courses", including "Last Active" and "Inactive".)

<6> "inactiveCoursePageNum" is the maximum pages of your "Inactive" courses you want to crawl.

<7> "phantomjsPath" is the executable file path of phantomjs.

CourseraScraper.py

python CourseraScraper.py The crawled data will be saved in folders. Every course has a folder named as courseName_courseID_crawlTime(%Y_%m_%d)

About

A crawler for Coursera

moocs learning-analytics dicussion-forum education-technology education-data-mining coursera coursera-discussion-forums natural-language-processing web-scraping corpus-data nlp-corpus

Languages

Language:Python 100.0%