Data Scientist in Practice

Welcome to my first data science project. It is part of the Data Scientist in Practice Marathon 3.0, from the YouTube portuguese channel https://www.youtube.com/channel/UCd3ThZLzVDDnKSZMsbK0icg. The marathon happened on 22th, 24th, 26th and 29th of March 2021. The teacher, Eduardo, shared the data and the code with comments, as well as video classes teaching how the code works.

The goal was to build a machine to predict if a loan is approved or not to be used by employees in credit loaning companies. The data provided by the teacher consists in 614 clients and their information, e.g. marital status, income, number of children, how much they wanted to loan and if that loan was approved or not.

Check below for the statistics and my analysis!

My experience during the project

At the time of the marathon, I was enrolled in TechLabs' Data Science course, so I already had basic knowledge of data science; the marathon helped me put all I learned into practice. I also learned a lot about machine learning, which I found easier than expected! I was able to use the base code, which was very good, and edit it as I pleased. The data and comments was originally in portuguese, I changed the NaN removal algorithms, improved the graphics.

Technologies used

The code was written in Python using Google Colabs. The graphics were built using seaborn and matplotlib. The machine learning was built using sklearn and RandomForestClassifier. The WEB system used HTML to set the font and colors.

The URL is an ngrok.io type. As it expires in some time, here is some screenshots to show the system works.

Data analysis

Variables and their meaning:

client_code: client code in the conglomerate
sex: sex
marital status: single/married
dependents: if the person has children and how many
education: schooling
employed: if the person is working or not
income: monthly income
spouses_income: income of the person's spouse
loan_value: value of the requested loan (in thousands)
installment: value of the monthly installment
credit_history: if the person has delayed a payment or is in breach
realty: if the person owns a place
loan_approval: if there was approval of the requested loan (variable to be predicted)

There were 614 entries and 13 columns. Some of the information was incomplete.

75% of the income and loan_value equal to 5795 and 168000, respectively. Considering that the maximum value for those variables is 81000 and 700000, this means that there are less people who receive a lot of money and want to loan a lot of money.

The characteristics with higher frequency were: male, married, graduated, with 0 children, with debts and own a semi urban property:

After creating the predictive machine, we can analyse the variables that influenced the most.

Since I ended up changing the random_state parameter of the train_test_split function from sklearn, my result was different from the teacher's, which is a good thing to notice and explore. The variables' influence ended up different, and the accuracy score was higher.

Conclusion

Learning how the random forest algorithm and its parameters work was incredible, it surely increased my interest in machine learning. It was great to notice how small changes change the outcome, and how there are many ways to do the same thing in data science.

Contributing

Pull requests are welcome! For major changes and suggestions, please open an issue to discuss what you would like to change. I would love to learn more.