EricWay1024 / uCourse-crawler

πŸŽ’ Scrape the courses info from the University of Nottingham's website.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

uCourse-crawler

πŸŽ’ Scrape the courses info from the University of Nottingham's website. (Different campuses and academic years supported.)

This fork is modified to fit the needs of Nott Course:

What have I done?

  • Included complete information of the course page;
  • Added the scraper for academic plans, with fully parsed plan structures;
  • Adapted the project to concurency using pupeteer-cluster;
  • Replaced Mongodb support with a converter from JSON to SQLite (for performance reasons).

Requirements

  • Nodejs
  • Python

Usage

Note that for performance reasons, no file will be written until all courses/plans are obtained.

git clone https://github.com/EricWay1024/uCourse-crawler
cd uCourse-crawler
mkdir dist
npm i
pip3 install pandas numpy

Set your university user name and password:

echo USER_ID="your_user_id"\nUSER_PASSWORD="your_password" > .env

Note that although all information on the resource website can be accessed without logging in, it now occasionally requires authentification. Please ensure the safety of your password.

Launch the course scraper:

node course.js

(This will save result to ./dist/courseData.json.)

Launch the plan scraper:

node plan.js

(This will save result to ./dist/planData.json. Also, a file ./dist/plans.json containing all possible plan UCAS codes will be created in the process.)

Convert to SQLite:

python3 find_deg.py
python3 to_sqlite.py

(This will infer the degree type of all plans and add to plan objects. Then the data is saved to ./dist/data.db.)

Output (JSON file)

For local JSON file, the output will be in a JSON format stored in /dist/[tablename].json.

The output example:

output-json

Size

The estimated output size will be 50~60 MB if both courses and plans are crawled for a campus a year.

Todo

Resources

Disclaimer

University of Nottingham has the copyright of all the data on its website. This crawler is intended for general information purposes only.

About

πŸŽ’ Scrape the courses info from the University of Nottingham's website.

License:MIT License


Languages

Language:JavaScript 92.1%Language:Python 7.9%