This repository contains content for the Big Data Analytics with Python course. In its latest iteration, the course was taught at The African Institute for Mathematical Sciences (AIMS), Rwanda in 2022 as part of the Master of Science in Mathematical Sciences (Data Science stream) program. For more details about this Masters programme, please check AIMS website.
This course aims to teach the students/participants the core concepts required to efficiently work with large datasets (aka of Big Data) and to equip the participants with knowledge of the essential tools and techniques for interacting with large scale datasets. The goal of the course is to introduce participants to the use of Python to perfom data science tasks such as data ingestion, data analysis and machine learning when faced with a large dataset. For more details about the course content, refer to this outline, otherwise, the main modules taught in the course are presented below.
- Module 1: Big Data Basics. See the lecture slides here.
- Module 2: Functional Programming and Distributed Data Processing. See the lecture slides here and the corresponding notebook here.
- Module 3: Data Gathering from APIs and the Web. See the lecture slides here and the corresponding notebooks here and here.
- Module 4: The Hadoop Ecosystem. See the lecture slides here and the notebook here.
- Module 5: Introduction to Apache Spark. See the lecture slides here and the corresponding notebook here.
- Module 6: Data Wrangling with Spark’s Structured API. See the lecture slides here and the corresponding notebook here.
- Module 7: Machine Learning with Apache Spark. See the lecture slides here and the corresponding notebook here.
The repository contains the following folders:
- SLIDES: This folder has all the powerpoint and Google slides with lecture notes. Due to the large size of the presentations, this folder will mostly be empty as I'm not uploading these large files in here. However, the presentations can be found on the link.
- DOCS: This folder contains miscelleanous documents for the course. For instancee, the course outline.
- NOTEBOOKS: This folder has all the source code for the tutorials.This includes the notebooks and Python files.
- DATASETS: As the name suggests, tis folder has the datasets which are used in the course. Again, because of the size, these datasets are not uploaded here.
- RESOURCES: In this folder, there are learning resources such as PDF books and articles.
- SOFTWARE: This folder has all the packages required for the course. As some of the installation files are large, they are not available here but they can be found on the Google Drive linked.
- ASSIGNMENT: This folder contains the course assignments.
In order to follow this material, the recommended approach is to tackle the modules as they are presented in the outline above. For each topic, go through the slides first and then move on to the tutorials in the notebooks. Its worth mentioning that since the course was delivered in person, the material isnt necessarily ideal for self paced learning but a person with reasonable prerequisite knowleedge can still follow the course and grasp the concepts.
For any questions regarding this course content, you can contact me through the two email adresses below: