all-in-one-place

data-science-book

Categorical encoding by contrast Coding
Frequency Encoding
Feature Engineering
- Indicator Variables: Bining
- Interaction Features: Sum, multiply, max, stat
- Feature Representation: Encoding
- Error Analysis (Post-Modeling): Feature Selection, cross-validation
- External Data: Domain Knowledge
Feature Selection
- Filter Methods
- Wrapper Methods
- Embedded Methods
- Difference between Filter and Wrapper methods
Feature Selection (kaggle)
1. Feature selection with correlation and random forest classification¶
2. Univariate feature selection and random forest classification
3. Recursive feature elimination (RFE) with random forest
4. Recursive feature elimination with cross validation and random forest classification
5. Tree based feature selection and random forest classification

stratify sampling or splitting
Feature Selection using RandomForest(ensemble method)
Distribution Check
scipy.stats.ks_2samp(data1, data2)
Dimensioality Reduction by preserving pairwise distances between samples Best for distance based method
1. GaussianRandomProjection
2. SparseRandomProjection
Feature Engineering(Stat)
Check all Missing Data
Transforming some numerical variables that are really categorical
Handling Skewed Features Box-Cox transformation
Base models -examples-with-cv
- LASSO Regression
- Elastic Net Regression
- Gradient Boosting Regression
- KernelRidge
- XGB
- GBM
- LGB
StackingAveragingModel -link-
LGB model -full example-

XGBModel (complete doc)
Custom objective and eveluation metrics function
General Approach for Parameter Tuning
- Control Overfitting:
- Handle Imbalanced Dataset
use XGBClassifier ==> That has CV method inbuilt
1. Fix learning rate and number of estimators for tuning tree-based parameters
2. Tune max_depth and min_child_weight
3. Tune gamma
4. Tune subsample and colsample_bytree
5. Tuning Regularization Parameters
6. Reducing Learning Rate and repeat steps again
Startified sampling
eval_result

SVD
Create the pipeline with gridsearch
Word-Vectors
- load the GloVe vectors in a dictionary:
- tnormalized vector for the whole sentence
Using all keras for text handling
- tokenizer
- pad-seq
- load embedding(from trained glove model)
- model to train