This is just a short description of my work flow, if you want to find hole work take a look at newPima Jupyter notebook or html file.
If your know about some supervised learning and want to apply you knowleadge with different classifier then this is the best data-set you can start with.
Last accouracy result :
GradientBoostingClassifier mean accuracy: 82.485 %
SVM classifier with linear kernel mean accuracy: 83.863 %
MLPClassifier(neural network mean accuracy: 85.153 %
RandomForestClassifier mean accuracy: 76.593 %
Extra Trees Classifier : 76.29 %
Recurment :
-
python 3.6
-
numpy
-
panda
-
matplotlib
-
seaborn
-
tensorflow
Data :
We have 768 instances and the following 8 attributes:
-
Number of times pregnant (preg)
-
Plasma glucose concentration a 2 hours in an oral glucose tolerance test (plas)
-
Diastolic blood pressure in mm Hg (pres)
-
Triceps skin fold thickness in mm (skin)
-
2-Hour serum insulin in mu U/ml (insu)
-
Body mass index measured as weight in kg/(height in m)^2 (mass)
-
Diabetes pedigree function (pedi)
-
Age in years (age)
Step :
-
Look the data carefully : In this part we see that there are so many zero or empty value in the data-set.
-
Vizualize the data : First i use some libary of matplotlib to vizualize the data next i use weka for an experiment. Weka give me more sensible result.
-
Look number of empty or zero value in each feature.
-
Choose the best feature that work for the data-set. Like Number of times pregnant is not a very good feature to be use. I use Weka to choose the best feature.
-
I find four feature is useful :
1. Age 2. BMI 3. Glucose 4. DiabetesPedigreeFunction
-
Then i apply 6 different classifier on this data-set and yes i choose training data and test data differently. I use libary from scikit-learn.
1. 'K nearest neighbors', 2. 'Decision Tree Classifier', 3. 'SVM classifier with RBF kernel', 4. 'SVM classifier with linear kernel', 5. 'Gaussian Naive Bayes', 6. 'GradientBoostingClassifier 7. MLPClassifier(neural network)
-
No, we reach just 77.407% accouracy.
-
Replace the zero or empty data with there mean value.
-
Then again visualize the data.
-
Select 4 feautre that are useful.
-
Again Apply the classifier's.
-
No, we are not done yet. Just 79.65 %.
-
Take the top 3 classifier that work well for the data-set.
-
Read the documention from scikit learn of those. Know about their parameter.
-
Then try diiferent combitaion of parameter, if you take a ML course you should you know how to use this wisely.
-
At last in MLP classifier we get 85% by applying 4 hidden layer and number of neuron 15,7,7,3 accordingly.