goldboy225 / Python-big-data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About Us GitHub

The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.

πŸš€ Case Study 1 : Pandas - Data Processing

Your submission will be evaluated using the following criteria:

  • Dataset must contain at least larger than 100MB
  • Please implement data processing related to the concept of big data.
  • You must ask and answer at least 5 questions about the dataset
  • Your submission must include explanations using markdown cells, apart from the code.
  • Your work must not be plagiarized i.e. copy-pasted from somewhere else.

Follow this step-by-step guide to work on your project.

Step 1: Select a real-world dataset

Step 2: Perform data preparation & cleaning

Step 3: Perform exploratory analysis & visualization

Step 4: Ask & answer questions about the data

Step 5: Summarize your inferences & write a conclusion

  • Write a summary of what you've learned from the analysis
  • Include interesting insights and graphs from previous sections
  • Share links to resources you found useful during your analysis

Step 6: Make a submission

🌟 Case Study 1: Solutions

Team Title Colab GitHub
404 Error Property in Kuala Lumpur Open in Colab Open in GitHub
Alrite ABC Open in Colab Open in GitHub
BEFE ABC Open in Colab Open in GitHub
Boboiboy Property Listings in Kuala Lumpur Open in Colab Open in GitHub
COLBY ABC Open in Colab Open in GitHub
FANTOM ABC Open in Colab Open in GitHub
HAHA Foreign Direct Investment In Malaysia Open in Colab Open in GitHub
HD ABC Open in Colab Open in GitHub
KIA Malaysia State Election 2018 Open in Colab Open in GitHub
LAB ABC Open in Colab Open in GitHub
MAAM ABC Open in Colab Open in GitHub
MEOW ABC Open in Colab Open in GitHub
MM Malaysia's 14th State Election Result Open in Colab Open in GitHub
PIXALATED ABC Open in Colab Open in GitHub
POTATO ABC Open in Colab Open in GitHub
QnX ABC Open in Colab Open in GitHub
SAMVERSE ABC Open in Colab Open in GitHub
SMOL Population in Malaysia from 2010-2019 Open in Colab Open in GitHub
SQ Number of Cases and Incidents Rate of Communicable Disease by State Open in Colab Open in GitHub
TUK Fraud Detection in Online Payment Open in Colab Open in GitHub
UWU Airline Delay 2017 Open in Colab Open in GitHub

πŸš€ Case Study 2 : Alternatives to Pandas for Processing Large Datasets

Pandas library has became the de facto library for data manipulation in python and is widely used by data scientist and analyst. However, there are times when the dataset is too large and Pandas may run into memory errors. Here are 8 alternatives to Pandas for dealing with large datasets. For each alternative library, we will examine how to load data from CSV and perform a simple groupby operation. Fortunately many of these libraries have similar syntax as Pandas hence making the learning curve less steep.

  1. Data Table
  2. Polars
  3. Vaex
  4. Pyspark
  5. Koalas
  6. cuDF
  7. Dask
  8. Modin

This case study is divided into two parts:

  1. Case Study: 2a
  • Please use the appropriate dataset.
  • You need to carry out an explanation related to the basic concept of the library.
  • Please show the code step by step of its implementation.
  1. Case Study: 2b
  • You are required to compare Pandas with the selected library
  • Make sure you use the same dataset when making comparisons.
  • You can also use visualization to show the comparison.

🌟 Case Study 2a: Solutions

Team Title Colab GitHub
1 DataTable Open in Colab Open in GitHub
2 Polars Open in Colab Open in GitHub
3 Vaex Open in Colab Open in GitHub
4 Pyspark Open in Colab Open in GitHub
5 Koalas Open in Colab Open in GitHub
6 cuDF Open in Colab Open in GitHub
7 DataTable Open in Colab Open in GitHub
8 Polars Open in Colab Open in GitHub
9 Vaex Open in Colab Open in GitHub
10 Pyspark Open in Colab Open in GitHub
11 Koalas Open in Colab Open in GitHub

🌟 Case Study 2b: Solutions

Team Title Colab GitHub
1 Pandas vs DataTable Open in Colab Open in GitHub
2 Pandas vs Polars Open in Colab Open in GitHub
3 Pandas vs Vaex Open in Colab Open in GitHub
4 Pandas vs Pyspark Open in Colab Open in GitHub
5 Pandas vs Koalas Open in Colab Open in GitHub
6 Pandas vs cuDF Open in Colab Open in GitHub
7 Pandas vs DataTable Open in Colab Open in GitHub
8 Pandas vs Polars Open in Colab Open in GitHub
9 Pandas vs Vaex Open in Colab Open in GitHub
10 Pandas vs Pyspark Open in Colab Open in GitHub
11 Pandas vs Koalas Open in Colab Open in GitHub

πŸš€ Project: Instructions

  1. You need to use a dataset that is larger than 1 GB. You can get the dataset from Kaggle or Dataset Search. The dataset file must be of CSV type.
  2. The dataset must be stored in Google Drive.
  3. Make sure you create a link to enable your dataset to be used on Google Colab.
  4. Please create operations related to big data that allow the dataset to be used.
  5. You need to use at least three libraries related to big data processing such as Pandas, Dask, Vaex and Modin.
  6. Please compare the processing results from the selected libraries.
  7. You need to use the concept of Exploratory Data Analysis (EDA) on this project.

🌟 Project: Solutions

Team Libraries for data science Colab GitHub
1 DataTable Open in Colab Open in GitHub
2 Polars Open in Colab Open in GitHub
3 Vaex Open in Colab Open in GitHub
4 Pyspark Open in Colab Open in GitHub
5 Koalas Open in Colab Open in GitHub
6 cuDF Open in Colab Open in GitHub
7 DataTable Open in Colab Open in GitHub
8 Polars Open in Colab Open in GitHub
9 Vaex Open in Colab Open in GitHub
10 Pyspark Open in Colab Open in GitHub
11 Koalas Open in Colab Open in GitHub

About


Languages

Language:Jupyter Notebook 100.0%