Fundamentals of Data Analysis Assessment Repository

Overview
Repository Contents
Requirements
How to run
References

Overview

This repository contains two Jupyter notebooks and ancillary files demonstrating some aspects of data analysis using the Python programming language and various associated technologies in fulfillment of requirements for the Fundamentals of Data Analysis module of the HDipSc in Computing in Data Analytics at the Galway-Mayo Institute of Technology (GMIT). One notebook, cao.ipynb focuses on the acquisition, parsing, and cleaning of data in various file formats from online sources without a formal API, while the other, pyplot.ipynb, focuses on data visualisation using the Matplotlib Python plotting library.

Repository Contents

The repository contains two Jupyter notebooks which are independent of one another;

The rest of the contents of the repository all support those files in some way. They are:

README.md; this file
requirements.txt; a list of Python packages required to run the notebooks
.gitignore; a git support file which may be safely ignored
images/; a directory containing some images which appear in the CAO notebook
data/; a directory containing:
- ed_centroids.csv; a csv file contain geographic location information used in a plot in the pyplot notebook
- cao/; a directory containing files consumed or produced by the CAO notebook:
- a number of pdf, xlsx, and html files containing data downloaded from the CAO website which is used by the CAO notebook. These files follow the naming pattern 'cao_', + YYYY + '_lvl' + '8', '76', or '876'. YYYY is the year the data in the file refers to and the number after 'lvl' is the level of the courses listed in the file.
- cao/csv/; a directory containing a number of csv files of parsed and cleaned CAO points data output by the CAO notebook. These follow the naming pattern 'cao_' + YYYY + '.csv', where YYYY is the year of the data held in the csv file. The file cao_2001-2021.csv. Contains all of the data contained in the individual year files. This (or, indeed any of the other 'cao_' csv files), can be imported into the CAO notebook for analysis without having to wait for data regeneration from source. Any of these files can, of course, also be used independently for other analyses.
- cao/backup/; a directory containing an identical set of files as those found in the source files directory (data/cao/) but with a timestamp appended to each filename. This is a set of backup source files automatically generated by the CAO notebook. Whenever the notebook is run it will check for changes to the source data, update it and back it up if found. Note that this system fails if the CAO change filename, format, or url for a resource. As such, its purpose is to back up old source data which has been inadvertently updated to aid in bug and error management and mitigation.

Requirements

Nothing extra is required to view the contents of the repository on github or nbviewer or binder. However see below for discussion of the limitations of these formats.

To run these notebooks locally Python v3.9+ with Pip or some other package manager is the minimum requirement. In order to clone this repository - the easiest way to acquire the code - git v.2+ is required. The Java Runtime Environment (JRE) 7+ or OpenJDK 7+ is required for some pdf parsing functionality in the CAO notebook.

Assuming Python is installed then the Python packages listed in requirements.txt are required. These can usually be installed in one go using the requirements.txt file with pip or, presumably, any other Python package manager. See below for details.

How to run

There are three ways to consume the notebooks in this repository:

View here on github by simply clicking on cao.ipynb or pyplot.ipynb, or on nbviewer by clicking on the appropriate button:
- For the CAO notebook:
- For the pyplot notebook:
This is fine if viewing is all that is required, but if interactivity is necessary or desirable then options 2 or 3 should be considered.
View and interact with the notebooks on binder by clicking on the button below:

This will give access to the entire repository via a JupyterLab session. The code in the notebooks can be changed and executed or new notebooks can be started to experiment with the data, which is, of course, also accessible from the binder session. However, because Java is not installed in the binder image, the PDF table extraction functionality in the CAO notebook will not be available, and cells which use it will return an error if an attempt is made to run them. This situation will almost certainly change in the future as it does seem to be possible to install arbitrary software in binder containers using the Nix package manager and a default.nix file in a binder folder in the repository [1].

It is still possible to view and manipulate the full CAO dataset from within a Binder session using the CSV files in the data/cao/csv directory.
Clone the repository and run a Jupyter server locally by following these steps (these steps have been tested on a Linux system, some details may differ if using a different operating system):
- Ensure that Python v3.9+, Pip, git v.2+, and Java v7+ (Oracle or OpenJDK) are all installed.
- Clone the repository by typing git clone git@github.com:fod/fundamentals-data-analysis.git into a terminal.
- The repository will be downloaded. When it is complete, enter the fundamentals-data-analysis directory and create a Python virtual environment in a directory called .venv with python -m venv .venv
- Activate the virtual environment with source .venv/bin/activate
- Next install the required packages with pip install requirements.txt
- Finally, start a jupyter server by typing jupyter-lab. A jupyter lab session should launch in a browser window. If it doesn't, a link which can be pasted into a browser address bar is printed in the terminal.

References

[1] Binder Development Team, 2021, default.nix - the nix package manager in Binder user guide [online]. Available from https://mybinder.readthedocs.io/en/latest/using/config_files.html#default-nix-the-nix-package-manager. Accessed 30-12-21.

fod / fundamentals-data-analysis