About Us

The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.

🚀 Case Study 1 : Pandas - Data Processing

Your submission will be evaluated using the following criteria:

Dataset must contain at least larger than 100MB
Please implement data processing related to the concept of big data.
You must ask and answer at least 5 questions about the dataset
Your submission must include explanations using markdown cells, apart from the code.
Your work must not be plagiarized i.e. copy-pasted from somewhere else.

Follow this step-by-step guide to work on your project.

Step 1: Select a real-world dataset

The dataset is available at:
- Kaggle
- Dataset Search

Step 2: Perform data preparation & cleaning

Step 3: Perform exploratory analysis & visualization

Step 4: Ask & answer questions about the data

Step 5: Summarize your inferences & write a conclusion

Write a summary of what you've learned from the analysis
Include interesting insights and graphs from previous sections
Share links to resources you found useful during your analysis

Step 6: Make a submission

🌟 Case Study 1: Solutions

Team	Title	Colab	GitHub
404 Error	Property in Kuala Lumpur
Alrite	ABC
BEFE	ABC
Boboiboy	Property Listings in Kuala Lumpur
COLBY	ABC
FANTOM	ABC
HAHA	Foreign Direct Investment In Malaysia
HD	ABC
KIA	Malaysia State Election 2018
LAB	ABC
MAAM	ABC
MEOW	ABC
MM	Malaysia's 14th State Election Result
PIXALATED	ABC
POTATO	ABC
QnX	ABC
SAMVERSE	ABC
SMOL	Population in Malaysia from 2010-2019
SQ	Number of Cases and Incidents Rate of Communicable Disease by State
TUK	Fraud Detection in Online Payment
UWU	Airline Delay 2017

🚀 Case Study 2 : Alternatives to Pandas for Processing Large Datasets

Pandas library has became the de facto library for data manipulation in python and is widely used by data scientist and analyst. However, there are times when the dataset is too large and Pandas may run into memory errors. Here are 8 alternatives to Pandas for dealing with large datasets. For each alternative library, we will examine how to load data from CSV and perform a simple groupby operation. Fortunately many of these libraries have similar syntax as Pandas hence making the learning curve less steep.

Data Table
Polars
Vaex
Pyspark
Koalas
cuDF
Dask
Modin

This case study is divided into two parts:

Case Study: 2a

Please use the appropriate dataset.
You need to carry out an explanation related to the basic concept of the library.
Please show the code step by step of its implementation.

Case Study: 2b

You are required to compare Pandas with the selected library
Make sure you use the same dataset when making comparisons.
You can also use visualization to show the comparison.

🌟 Case Study 2a: Solutions

Team	Title	Colab	GitHub
1	DataTable
2	Polars
3	Vaex
4	Pyspark
5	Koalas
6	cuDF
7	DataTable
8	Polars
9	Vaex
10	Pyspark
11	Koalas

🌟 Case Study 2b: Solutions

Team	Title	Colab	GitHub
1	Pandas vs DataTable
2	Pandas vs Polars
3	Pandas vs Vaex
4	Pandas vs Pyspark
5	Pandas vs Koalas
6	Pandas vs cuDF
7	Pandas vs DataTable
8	Pandas vs Polars
9	Pandas vs Vaex
10	Pandas vs Pyspark
11	Pandas vs Koalas

🚀 Project: Instructions

You need to use a dataset that is larger than 1 GB. You can get the dataset from Kaggle or Dataset Search. The dataset file must be of CSV type.
The dataset must be stored in Google Drive.
Make sure you create a link to enable your dataset to be used on Google Colab.
Please create operations related to big data that allow the dataset to be used.
You need to use at least three libraries related to big data processing such as Pandas, Dask, Vaex and Modin.
Please compare the processing results from the selected libraries.
You need to use the concept of Exploratory Data Analysis (EDA) on this project.

🌟 Project: Solutions

Team	Libraries for data science	Colab	GitHub
1	DataTable
2	Polars
3	Vaex
4	Pyspark
5	Koalas
6	cuDF
7	DataTable
8	Polars
9	Vaex
10	Pyspark
11	Koalas

About

Languages

Language:Jupyter Notebook 100.0%