DeepmindHub/AnalyticsVidhya-Hackathon-3.X

Data Preprocessing ( R Code)

I looked into levels of data and created a data dictionary by mentioning the level gaps, as I figured out that there is difference in level of data in training and testing data set (Like some cities are only in training dataset but are missing from testing and vice versa)
Treated City and Employee Name column by removing extra spaces and making proper font
Removed Extra levels from city by looking at count of cities finally reduced it to 15 levels by making other cities as "Others"
Removed some extra levels from Employee names, replaced all Employers below 30 cases to "Others"
Extracted Date, Month and Year from DOB column and then removed DOB because of many levels
Extracted Day and Month from Lead Creation Date, but kept Lead creation date
Replaced missing values of Loan Amount and Tenure submitted from Loan Amount and Tenure Applied
Replaced missing values of Processing Fee to zero
Imputed missing value of Interest Rate, Loan Amount Submitted and Loan tenure by using bagged imputation from R caret
Created a new variable of EMI_calculated : E = P×r×(1 + r)n/((1 + r)n - 1)
Created a new variable of Future_EMI_perincome ratio : (Existing EMI + EMI submitted)/ Monthly Income, restrited value till 2
Removed outlier from Monthly income by anything greater than 1,000,000 to 1,000,000
Created a new variable Process_percent : (Processing Fee/ Monthly Income) * 100, restricted it to 40
Created two variables exist_EMI_perincome(Existing EMI / Monthly income) and exx_EMI_perincome (EMI_calculated/ Monthly income)

Modelling (Python)

Used Extreme Gradient boosting and optimized the tuning parameterss based on local CV score, as many solutions on LB were proven to be overfitting in Weekender version
Final XGB model had a Local CV(4-Fold) score of 0.854141 +- 0.004308 and a LB rating of 0.85456
Used a Random forest classifier(1000 trees) and tuned it on a 75:25 approach
Final RF model was having local score of 0.84233 and a LB rating of 0.85213
Finally used Rank Average Ensembing for the final solution. Weights (2*XGB_score + Rf_score)/3

DeepmindHub / AnalyticsVidhya-Hackathon-3.X

Data Preprocessing ( R Code)

Modelling (Python)

About

Languages