marwagaser / DataEngineering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the Repo

Colab DEMO

Colab Jupyter Notebook of this project

Overview and Motivation

In this notebook 3 main stages of data preprocessing are applied. However, before discussing the stages, a clarification of the used data set will be made.

To begin with, the Nobel prize dataset is obtained from Kaggle: Nobel Prize Data Set. The data set is mainly concerened with the Nobel Prize Laureates (from 1901 until 2016)and some information about them, and the organizations they belong to.

1. The first data preprocessing step to be done in this notebook is Data Cleaning where the null values are imputed, or dropped. Furthermore, unwanted characters are removed from the dataset using regex.
2. The second stage is Data Integration. This is were external data sets are fetched and are merged with the exitsing one to fill in missing values, in our case in this notebook.
3. Finally, Data Reduction is to be done to remove unnecessary columns which aren't needed in our analysis.

Data Exploration Questions

The columns to be reduced or dropped are based on our target analysis of this data set.
Primarily, we wanted our analysis to be focused on female winners, however, through our initial phase of exploratory analysis, we discovered that a big percentage of the records are males, so we decided to modify our analysis to be slightly more general. Hence, our target for analyzing this data set is to answer some important questions under several themes. Those questions are: a. What is the most repeated word in the motivation of the category that interests women the most?
b. How has the number of female winners changed over the years, in the organization country where most female winners come from?
c. Which fields are most prevalent in each organization country?
d. Most productive organization countries over the decades?
e. Most productive organization, in the most productive country, throughout the decades?
f. Which fields interests which age groups?

Related Work

1. The following datacamp project questions inspired us to explore certain attributes of the dataset.
2. The following GitHub repository helped us in data visualization related to the Nobel Prize dataset.
3. In addition, the following dataset was used in an attempt to find missing values
4. The following dataset was retrieved from the following GitHub repository
5. The lecture slides of Dr. Mervat AbuElKheir helped us in understanding the stages of data preprocessing

About


Languages

Language:Jupyter Notebook 100.0%