Scrape course information Australian universities. Although some of them call programs "courses" and call courses "units". Potato, potato, the same thing.
The result is saved in aus-uni.json
.
To install RSelenium via Docker, please refer its installation guide.
To start Selenium, use the following command:
docker run -d -p 4445:4444 selenium/standalone-firefox
To stop Selenium, use this:
docker stop [container_id]
- ANU
- University of Melbourne
- University of Sydney, source 2
- University of New South Wales
- University of Queensland
- Monash University
- University of Adelaide
- University of Western Australia
- RMIT
- Deakin University
- Victoria University
- Bond University
- Curtin University
- Griffith University
- Murdoch University
- Southern Cross University
- University of Newcastle
- University of Wollongong
- University of Sunshine Coast
- Macquarie University
**Data from the following unis are scraped by my colleague:
- University of Technology Sydney
- University of Canberra
- Queensland University of Technology
- La Trobe University
- University of Tasmania
The content is inserted via JavaScript. So Selenium is required to render information before extracting real data.
In order to navigate to current page for scraping, a user need to click on Courses in the middle of the page and then navigate to the bottom and click on Show all results. Otherwise, even if you could "see" them all, no data would be extracted as you wish. (Probably you would get an TypeError: rect is undefined
. This is because you are likely using a correct xpath/css selector/etc. after rendering to find your targets before rendering.)
Execute the commands in ANU.R
, and a collection of courses would be stored in json format in directory data
.
The webpage showing all courses info is paginated. Initially, I thought we need to do these extra steps:
- Accept cookies;
- Get total page numbers;
- Navigate automatically.
With a popup window telling me to accept cookies, I turned myself to Selenium again for help. But it was really painful to deal with an error caused by remDr$navigate(new_url)
, especailly it kept displaying an UnknownError
message. Well, the ending of this story is not too bad as I realized I could just keep the popup window hanging. Screw Selenium!
Different error every time.
Error Message:
Selenium message:Browsing context has been discarded
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: '7de870fafffd', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.9.184-linuxkit', java.version: '1.8.0_222'
Driver info: driver.version: unknown
Error: Summary: NoSuchWindow
Detail: A request to switch to a different window could not be satisfied because the window could not be found.
class: org.openqa.selenium.NoSuchWindowException
Further Details: run errorDetails method
Error message 2:
Selenium message:Unable to locate element: #b-js-course-search-results-uos > div:nth-child(1) > div:nth-child(3) > a:nth-child(7)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: '7de870fafffd', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.9.184-linuxkit', java.version: '1.8.0_222'
Driver info: driver.version: unknown
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Turns out loading JavaScript takes a while on their website. Adding some downtime to each click on the web would be a nice solution.
At some point, the data on this site has been changed. So I wrote a new crawler to replace the previous one. But still, Selenium was utilized during this process.
But the problem still remains as some pages contain "nothing" to scrape. Eventually, I came up with a potential solution of a manual "try-catch": if nothing is scraped, repeat until something is retrieved.
It is quite weird that none of the financial units could be seen in the list, but they were de facto existing in the search result. In order to jump out of this paradox, I found another page which only contains html table in it. I scraped it and took the union of today's data and the previous one. Done.
The only thing worth mentioning is different subject pages might contain different number of "sections". Some area of subject has three sections: undergraduate, postgraduate and research. But some only has one.
This one is straightforward and tricky at the same time. The target urls to scrape are not in the same form all the way. Pay attention to this.
There are 3 different types of program pages:
- The normal ones with normal urls to course lists;
- The ones with different urls to course lists;
- The ones without any detailed course lists.
Additionally, there are some program pages, though directing to an existing course list page, but it contains nothing substantial.
Starting from program "Agribusiness" the course list url shifts from: https://my.uq.edu.au/programs-courses/plan_display.html?acad_plan=[program-code] to https://my.uq.edu.au/programs-courses/program_list.html?acad_prog=[program-code] then probably it's better to get the exact url from previous page instead of "putting them together" also major_data is different in the second type of webpage
RMIT is the only uni so far that requires a document analysis (kinda). Thanks to the formality of a table (in fact, multiple tables), it is not painful to extract all of them.
On the contrary, I first thought Deakin's data is only contained in its corresponding pdf version handbook. That was really a PITA (guess what does this mean?) because although a toc is two-column, the data scraped is not. So matching left data from the left column and its leftovers onto the next line is even impossible. It feels really confusing even now I'm describing this to you and to myself. Just forget about it, I eventually figured out where to scrape the data.
the lazy loading mechanics (if i'm correct) is really annoying in this website.
In detail, if we dive into the developer console of web browser, the secret is unveiled: each time we scroll to the end of this page, an additional 500 results will be rendered after a few seconds. So the best practice here is to directly get the response as json, dump them locally and deal with them later.
A response url looks like this and we can modify the parameters to catch them all at the same time!
The next step is to deal with json file ;)
Search results only contain top 50 out of all results. :( Why? It's impossible to extract all data if the data itself is kept secret.
2020-05-01. The unit codes grew from 3 digits to 4 digits earlier this year. All data are updated accordingly.
- Scraping HTML Text
- Convert a list to a data frame
- R grep pattern regex with brackets
- strsplit with vertical bar
- Remove a subset of records from a dataframe in r
- Merge extra contents to the left when splitting one column to two with multiple delimiters
- How to split a string from right-to-left, like Python's rsplit()?
- Why is message() a better choice than print() in R for writing a package?
- How to extract date from a multiline string or File in R