AI Camp Data Science Advanced Walkthrough

A case study put together by instructor Cameron Jackson. Note that this is an advanced version of this project that specifically uses a messy and hard-to-use dataset to give a thorough review of several of the potential work required while cleaning, exploring, visualizing, and performing machine learning during this project. It is not required for instructors to use this as a resource, but we do recommend reading through and running these notebooks to understand the story told by this data.

NOTE: This a very advanced example of the project in which, it is acceptable and expected for students and instructors to do a much simpler version of the project.

🧼 Cleaning the Data

To see just how different and dirty real world data can be, look through section 1 and section 2 of the notebook: https://github.com/organization-x/DS-Course/blob/main/DS_Merged_Data_Exploration.ipynb To summarize how messy the above data is, the different datasets that we were merging had several problems that I’ll list below:

Different Naming conventions between datasets, ie using “&” vs writing out “and”
Random extra spacing in the different names of countries
Using outdated names of countries, requiring looking up countries histories to figure out what the datasets were referring to
Including or excluding certain territories without any noticeable pattern
Missing data in certain columns

We will be using the pandas python library to manage all of our data needs throughout our project so I recommend becoming familiar with the library. At this point, you should start looking for different pieces of data that you can use to answer the question you started with. You should start to map out correlations between different pieces of information you have gathered to see if it is relevant to the thesis that you created at the beginning. You can use pandas to show a correlation map between different columns of the data to help start this conversation.

📠 Choosing your ML model / Training the model

To select a machine learning model, you should be very intentional in defining the thesis you are trying to prove / disapprove, then training the model accordingly based on the type of machine learning problem your thesis presents (classication or regression).

From there, you need to be ready to use evaluation metrics and visualizations as a part of your analysis. In order to do so, we will need to split our data into train/test categories using sklearn’s train_test_split() function.

Here is a list of possible ML models that you can look to a guidance as you start to explore the different schools of thinking that produce different models.

Extension: Deploying a DS Project to the Web

This is not covered by this walkthrough.

Source of Data:

https://www.kaggle.com/datasets/majyhain/height-of-male-and-female-by-country-2022 https://www.kaggle.com/datasets/fernandol/countries-of-the-world

organization-x / DS-Course