EdgarACarneiro / feup-ecac

Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feup-ecac

Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.

A Summary of the theoretical material is available here.

Folder /research-project contains materials that were necessary to the develop the ECAC research project (2nd project).

Project Grade

Component Grade
Project 20
Classification 17

Usage

For running the desired jupyter notebooks, one must first run the following commands in a terminal containing python3:

  • In Mac/ Linux
python3  -m venv venv
. venv/bin/activate
pip install _U -r requirements.txt
jupyter notebook
  • Windows
py -3 -m venv venv
venv\Scripts\activate
pip install _U -r requirements.txt
jupyter notebook

In the end, the virtual environment can be terminated by running:

deactivate

In the jupyter notebook web page, open the pre_processing.ipynb file first and simply run all cells.

Thereafter, open the prediction.ipynb file. Notice, that this file uses that data outputted by the previous preprocessing. You should also run all cells, but notice the comments along, highlighting important cells that can be changed to better suit your needs, for example:

# CHANGE THIS LINE TO CHANGE THE USED CLASSIFICATION METHOD
classifier = create_DT()

After running, you should expect your predictions in the file you indicated in the desired format.

Final submission

Final presentation slides available here.

Final leaderboards available here - placed 9️⃣.

Submission history

  • ❗ : Submissions selected for competition scoring. Notice that we did not have access to the private score when choosing the two submissions.
Public Score Private Score Local Score Date Improvement to previous submission
0.59259 0.57160 Not recorded 23.09.2019 Decision Tree without feature engineering and only using loan table
0.61049 0.59876 Not recorded 23.09.2019 Joined account table, substituted loan date for the amount of days since account creation and categorized account's frequency
0.56543 0.61728 Not recorded 24.09.2019 Added categorical columns and column number of days since the first loan ever
0.62839 0.65864 Not recorded 24.09.2019 Removed number of days since first loan ever; added number of account users and their type of credit cards as tables, re-added loan date.
0.50000 0.50000 Not recorded 25.09.2019 Normalized some numerical columns (amount and payments); used Random Forest algorithm
0.62839 0.58888 Not recorded 26.09.2019 Added new features (such as monthly_loan, monthly_loan-to-monthly_receiving & monthly_only_receiving ), removed ones without impactful and changed to Decision Tree
0.59259 0.63209 Not recorded 26.09.2019 Removed loan_id feature
0.57716 0.60802 Not recorded 27.09.2019 Fixed merge of tables in previous submission
0.75370 0.75308 Not recorded 29.09.2019 Added transactions table and reworked the flow of the entire project, making it way easier to customize
0.81728 0.75679 Not recorded 29.09.2019 Added demographic table
0.84135 0.77716 Not recorded 30.09.2019 Removed redundant features, changed join on district_id of account to district_id of client
0.88148 0.68148 Not recorded 01.10.2019 Experimented with grid search hyper parameter running
0.85925 0.73518 Not recorded 03.10.2019 Changed Classifying model, after grid searching Decision Tree as it had better performance
0.64197 0.59876 Not recorded 04.10.2019 Implemented PCA
0.83580 0.80555 0.781090 04.10.2019 Increased local score using feature selection
0.89259 0.75555 0.832430 04.10.2019 Added class weighting to RandomForest and GradientBoosting
0.85617 0.73765 0.848035 09.10.2019 Now considering households and pensions. Fixed numerical imputation not working correctly.
0.82839 0.72530 0.862035 10.10.2019 Experimented with under sampling
0.79444 0.64012 0.840876 10.10.2019 Added bank demographic data
0.90123 0.79506 0.842036 11.10.2019 Heavy feature engineering. Consistent results locally.
0.88333 0.81666 0.852039 11.10.2019 Small improvement locally using feature selection and feature engineering.
0.72530 0.71913 0.841861 12.10.2019 Heavy feature selection. Removing features without correlation to loan status.
0.77020 0.73333 Not recorded 15.10.2019 Hardcore feature selection. Using only 7 features.
0.85000 0.81049 0.824199 17.10.2019 Fixed some local bugs. Heavy feature selection, both automatic and manual.
0.79753 0.68827 0.828777 18.10.2019 Very consistent results. S'more feature engineering and selection.
0.77160 0.75617 0.799563 19.10.2019 Decision Tree of depth 2. Constant AUC of 80%, probably small error interval.
0.78353 0.68353 0.937524 21.10.2019 Applied backward elimination. Using LinearRegression. Constant local score.
0.70432 0.58271 0.860821 21.10.2019 Feature selection using backward elimination and RFE on LogisticRegression
0.71913 0.83395 0.845231 24.10.2019 Using most consistent local with SMOTETek sampling and Gradient Boosting.
0.85864 0.74012 0.867982 24.10.2019 Best local scoring setup.
0.83209 0.78641 0.864521 25.10.2019 Random Forest with SMOTETEEN and Filter Method as feature selection. Locally consistent.
0.74074 0.79506 0.850971 25.10.2019 Best local Decision Tree, with SMOTETEEN and Filter Method as feature selection. Likely to overfit.

Useful links

About

Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.

License:MIT License


Languages

Language:Jupyter Notebook 97.3%Language:Python 2.7%