Coursera Search is a system that allows user to perform course search based on titles in Coursera. This system will scrape course data from https://www.coursera.org/courses/ and store them in a SQLite database. The course data information includes course title, partner, rating, number of ratings, enrollment number, and difficulty level. The user can then search for courses by entering a title keyword in the web app, hit the search button and then browse the result.
- Python 3.8.2
- Scrapy 1.8.0
- SQLite 3.28.0
- R 3.6.1
- shiny 1.4.0
- shinyjs 1.0
- RSQLite 2.2.0
- DBI 1.1.0
- shinythemes 1.1.2
- dplyr 0.8.3
- tidytext 0.2.2
- tm 0.7.7
- shinyBS 0.61
- course_getter: contains Scrapy project
a/ spiders:
courses.py: Initiate HTTP requests, specify fields to be scraped
b/ items.py: Define the model for scraped items
c/ middlewares.py: Define the model for spider middleware
d/ pipelines.py: Create and store scraped data in SQLite database
e/ settings.py: Scrapy settings for course_getter project - responses:
a/ courses.html: Manually downloaded from https://www.coursera.org/, used to create fake HTML responses for unit testing
b/ response.py: Create a Scrapy fake HTTP response from a HTML file - search_app:
a/ tests: Generated snapshot testing scripts for RShiny app
b/ covr.R: Report test coverage for unit testing of helper functions (test_search_app.R)
c/ helper_fct.R: Helper functions for Rshiny app
d/ server.R: Server class of RShiny App
e/ test_search_app.R: Unit Test script for helper_fct.R
f/ ui.R: UI class of RShiny App - .travis.yml: Script for Travis CI test.
- DESCRIPTION: Required packages for the system to execute.
- process.sh: Script to automate the project
- requirements.txt: Required packages for the system to execute.
- scrapy.cfg: Config file for Scrapy project
- test_spider.py: Unit testing script for Scrapy spider
- Install required packages:
$ pip install -r requirements.txt
- Run
$ sh process.sh
- Click on the localhost link (For example:
http://127.0.0.1:5893
) to access the web app
- This app only allows the user to enter one keyword.
- Proposal: update CreateCorpus() & TitleScore() in helper_fct.R such that it treates query as a list instead of a single character.
- This app only allows the user to search for titles.
- Proposal: update CreateCorpus() in helper_fct.R such that includes fields other than 'title'.
- Include code coverage report for test_search_app.R in travis CI. Right now it is 100%, but I have yet to find a way to run coverage report on test_search_app.R from command line.
- Scenario 1: When
www.coursera.org/courses
change their page layout
- Proposal: Change content extraction paths in the parse function in
courses.py
to match the new layout
titles = response.xpath('//h2[@class="color-primary-text card-title headline-1-text"]/text()').getall()
partners = response.css("span.partner-name::text").getall()
ratings = response.css("span.ratings-text::text").getall()
count = response.xpath('//span[@class="ratings-count"]/span/text()').getall()
enrollment = response.css("span.enrollment-number::text").getall()
level = response.css("span.difficulty::text").getall()
- Scenario 2: When coursera moves to another website/When we use this tool to work on another page
- Revise Scenario 1
- In
process.sh
, uncomment the following line and change the name of the link:
# Scrape courses from a custom page in Coursera (uncomment and change address to use)
scrapy crawl courses -a address=https://www.coursera.org/courses
- Scenario 3: store the database in a file different from
coursera.db
- In
course_getter/settings.py
, change the name of the database (don't remove 'search_app'):
DB_SETTINGS = {
'db':"search_app/coursera.db"
}
- Name: Chau Pham
- Email: chautm.pham@gmail.com