The Drug Classifcation Analysis is used to analyse the effect of a particular drug based on certain paramrters (Age,Sex,BP,Cholesterol,Na_to_K) and finding an effective model which holds a strong relation with the parameters to predict the specific drug consumption index.
The dataset used is the Drug Classification With Different Algorithms from Kaggle.
The 6 class labels are:
- Age :Age of the person (int64).
- Sex :Gender the person holds(object or categorical) (Male or Female).
- Cholesterol :Fat level of the person (object or categorical) (High or Low or Normal).
- Na_to_K :Sodium or Potassium content of the body (float64).
- BP : Blood Pressure of the person (object or categorical) (High or Normal).
Target Variable:
Drug (object or categorical)
Drug refer to the type of drug consumed (through medication or direct injection)
Type:
A,B,C,X,Y
- KNN Classifier
In this kernel, parameters of KNN Algorithm are described and effects of these paremeters on result are observed. First prediction is predicted with default parameters and this result is used for comparing. After that, best value of every parameters are found and are discussed their effects on result.Finally, GridSearch algorithm is used to find best values of each parameters. So results can be compared each other in the conclusion part.
i) Calculate distance
ii) Find closest neighbors
iii)Vote for labels
- Random Forest
The Random forest or Random Decision Forest is a supervised Machine learning algorithm used for classification, regression, and other tasks using decision trees. The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.
Based on the MSE the entropy of the system is reduced to get the best classification.
- SVM Classifier
Support Vector Machines
Generally, Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.
i) Generate hyperplanes which segregates the classes in the best way. Left-hand side figure showing three hyperplanes black, blue and orange. Here, the blue and orange have higher classification error, but the black is separating the two classes correctly.
ii) Select the right hyperplane with the maximum segregation from the either nearest data points as shown in the right-hand side figure.
- Need to bring some improvemrnt in the data cleaning methods through standardised scaling non object variables.
- Merging more classes for analysis (eg medication consumption rate, other mineral components comsumed etc).
- Check for multicollinearity between parameters for significance.