- Categorical Variables
- Linear Model Assumptions - Do the variables fulfill those assumptions?
- Distributions - Normal, skewed, and others
- Outliers
- Feature Magnitude - Scale of the different features
- Strings are incompatible with Scikit-Learn
- Variables with high cardinality OR Uneven Distibution between train and test sets results in overfitting in tree based algorithms and operational problems
- Reducing cardinality may help improve model performance
- Error in operation/production occur becuase the model is learning from the labels seen in the train set and a new unseen label comes then thhe model will be un able to perform calculation
Mathematical transformations
- Logarithmic - Log(X) --> x>0
- Reciprocal - 1/X -->x!=0
- Power/Exponential X exp(lambda)/sqrt(X)/cube(X) --> Not defined foe all x
- Exponential special cases - Box Cox(x>0)/Yeo Johnson
Discretisation(Binning) is the process of transforming continious variables into discrete variables by creating a set od intervals that span the range of the variables values Discretisation helps handle outliers and may improve value spread in skewed variables.
- Equal width discretisation
- Equal frequency discretisation
- K-means discretisation
- Discretisation using decision trees
- Day
- Name - Mon, Tue, Wed,...
- Number - 1-7
- IsBusinessDay, IsHoliday, IsWeekend
- Month
- Name -Jan, Feb, March,...
- Number - 1-12
- Quarter
- Semester
- Year
- Hour Did TS take place in morning, afternoon, night
- Minutes
- Seconds
- Days
- Months
- Years
- Hrs
- Minutes
- Seconds
Payment data 1 ('29-08-1987 15:20.20+02') ---> Payment data 1 ('29-08-1987 13:20.20+00') Payment data 2 ('29-10-1993 15:20.20+05') ---> Payment data 2 ('29-10-1993 10:20.20+00')