- Model Selection: xgboost
- Data Pre-processing Steps used
- Remove features with missing values larger than the threshold value(0.7) for the entire dataset
- Clean dataset
- Handle the missing values for numerical features
- Fill missing values with median of the features.
- Not using mean, in order to reduce the effect of outliers.
- Handle the missing values for Categorical features
- Used the most frequent value of each categorical feature to fill the missing values.
- Encoding categorical features using Target Encoding(Mean Encoding) for encoding the categories
- useful when the cardinality of categorical variable is very high.
- Though susceptible to over-fitting.
- use of regularization to prevent/reduce over-fitting.
- Scaling Features
- Considering each feature lies in a separate range, bring them to the same scale.
- used 'Standard Scaling' i.e. z score for scaling.
- useful in reducing the effects of outliers.
git clone https://github.com/abhianand7/ReduceNPA.git .
cd ReduceNPA
pip install -r requirements.txt
python main.py
{'max_depth': 8, 'eta': 0.1, 'objective': 'binary:logistic', 'gamma': '0.3'}
boost_rounds = 200
- Prepare a proper data pipeline so that the data pre-processings can be applied without any hassle.
- Improvements on categorical features:
- handling less frequent categorical variables properly
- applying more feature engineering on categorical features
- possible option of using embeddings on text based features for adding more info to the model
- gridsearch for finding more optimal parameters for xgboost model
- using better regularization techniques to prevent over-fitting
- overall feature engineering requires more attention