feup-ecac

Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.

A Summary of the theoretical material is available here.

Folder /research-project contains materials that were necessary to the develop the ECAC research project (2nd project).

Project Grade

Component	Grade
Project	20
Classification	17

Usage

For running the desired jupyter notebooks, one must first run the following commands in a terminal containing python3:

In Mac/ Linux

python3  -m venv venv
. venv/bin/activate
pip install _U -r requirements.txt
jupyter notebook

Windows

py -3 -m venv venv
venv\Scripts\activate
pip install _U -r requirements.txt
jupyter notebook

In the end, the virtual environment can be terminated by running:

deactivate

In the jupyter notebook web page, open the pre_processing.ipynb file first and simply run all cells.

Thereafter, open the prediction.ipynb file. Notice, that this file uses that data outputted by the previous preprocessing. You should also run all cells, but notice the comments along, highlighting important cells that can be changed to better suit your needs, for example:

# CHANGE THIS LINE TO CHANGE THE USED CLASSIFICATION METHOD
classifier = create_DT()

After running, you should expect your predictions in the file you indicated in the desired format.

Final submission

Final presentation slides available here.

Final leaderboards available here - placed 9️⃣.

Submission history

❗ : Submissions selected for competition scoring. Notice that we did not have access to the private score when choosing the two submissions.

Public Score	Private Score	Local Score	Date	Improvement to previous submission
0.59259	0.57160	Not recorded	23.09.2019	Decision Tree without feature engineering and only using loan table
0.61049	0.59876	Not recorded	23.09.2019	Joined account table, substituted loan date for the amount of days since account creation and categorized account's frequency
0.56543	0.61728	Not recorded	24.09.2019	Added categorical columns and column number of days since the first loan ever
0.62839	0.65864	Not recorded	24.09.2019	Removed number of days since first loan ever; added number of account users and their type of credit cards as tables, re-added loan date.
0.50000	0.50000	Not recorded	25.09.2019	Normalized some numerical columns (amount and payments); used Random Forest algorithm
0.62839	0.58888	Not recorded	26.09.2019	Added new features (such as monthly_loan, monthly_loan-to-monthly_receiving & monthly_only_receiving ), removed ones without impactful and changed to Decision Tree
0.59259	0.63209	Not recorded	26.09.2019	Removed loan_id feature
0.57716	0.60802	Not recorded	27.09.2019	Fixed merge of tables in previous submission
0.75370	0.75308	Not recorded	29.09.2019	Added transactions table and reworked the flow of the entire project, making it way easier to customize
0.81728	0.75679	Not recorded	29.09.2019	Added demographic table
0.84135	0.77716	Not recorded	30.09.2019	Removed redundant features, changed join on district_id of account to district_id of client
0.88148	0.68148	Not recorded	01.10.2019	Experimented with grid search hyper parameter running
0.85925	0.73518	Not recorded	03.10.2019	Changed Classifying model, after grid searching Decision Tree as it had better performance
0.64197	0.59876	Not recorded	04.10.2019	Implemented PCA
0.83580	0.80555	0.781090	04.10.2019	Increased local score using feature selection
0.89259	0.75555	0.832430	04.10.2019	Added class weighting to RandomForest and GradientBoosting
0.85617	0.73765	0.848035	09.10.2019	Now considering households and pensions. Fixed numerical imputation not working correctly.
0.82839	0.72530	0.862035	10.10.2019	Experimented with under sampling
0.79444	0.64012	0.840876	10.10.2019	Added bank demographic data
❗ 0.90123	0.79506	0.842036	11.10.2019	Heavy feature engineering. Consistent results locally.
0.88333	0.81666	0.852039	11.10.2019	Small improvement locally using feature selection and feature engineering.
0.72530	0.71913	0.841861	12.10.2019	Heavy feature selection. Removing features without correlation to loan status.
0.77020	0.73333	Not recorded	15.10.2019	Hardcore feature selection. Using only 7 features.
0.85000	0.81049	0.824199	17.10.2019	Fixed some local bugs. Heavy feature selection, both automatic and manual.
0.79753	0.68827	0.828777	18.10.2019	Very consistent results. S'more feature engineering and selection.
0.77160	0.75617	0.799563	19.10.2019	Decision Tree of depth 2. Constant AUC of 80%, probably small error interval.
0.78353	0.68353	0.937524	21.10.2019	Applied backward elimination. Using LinearRegression. Constant local score.
0.70432	0.58271	0.860821	21.10.2019	Feature selection using backward elimination and RFE on LogisticRegression
0.71913	0.83395	0.845231	24.10.2019	Using most consistent local with SMOTETek sampling and Gradient Boosting.
❗ 0.85864	0.74012	0.867982	24.10.2019	Best local scoring setup.
0.83209	0.78641	0.864521	25.10.2019	Random Forest with SMOTETEEN and Filter Method as feature selection. Locally consistent.
0.74074	0.79506	0.850971	25.10.2019	Best local Decision Tree, with SMOTETEEN and Filter Method as feature selection. Likely to overfit.

Useful links

About

Project developed for 'Knowledge Extraction and Machine Learning', a fifth year subject @FEUP. Made in collaboration with @cyrilico.

MIT License

Languages

Language:Jupyter Notebook 97.3%Language:Python 2.7%