Practicing Various Models on Fertility Data Set
100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits
- Data Set Characteristics: Multivariate
- Attribute Characteristics: Real
- Area: Life
- Number Of Attributes:10
- Number Of Records:100
To get the data set for yourself click here
To know more about the data click here
- Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)
- Age at the time of analysis. 18-36 (0, 1)
- Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)
- Accident or serious trauma 1) yes, 2) no. (0, 1)
- Surgical intervention 1) yes, 2) no. (0, 1)
- High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)
- Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)
- Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)
- Number of hours spent sitting per day ene-16 (0, 1)
- Output: Diagnosis normal (N), altered (O)
Used 3 models for now :
- KNN
- LogisticRegression
- SVC
- ANN
Procedure
- Data was loded into python by using Pandas library.
- Data was split into training and testing data using the train_test_split module in the model__selection library of sklearn
- For KNN a loop was used to check different n_neighbours values in between 1 to 40. The mean error_rate was plotted and can be seen that for any value greater than 1 the model has least error so taking the least value of n_neigbours i.e. 2.
- For SVC again from model__selection library of sklearn module GridSearchCV is used so as to find the best combination for the SVC parameters C and gamma.
The best combination of parameters was found to be 'C': 0.1 & 'gamma': 0.1 - Using this estimator data was fit into the SVC model and trained and tested.
- For Logistic Regression, simply the data was fit into the model and values were tested.
- Evaluation of the models was done by sklearn's metrics library. Modules classification_report and confusion_matrix were used to check the confidence/accuracy of the model.
- For ANN, a simple shallow neural network was built with one hidden layer. The Keras library was used with the 'accuracy' metric.
- Accuracy Scores of all the models can be seen as the same as 91%.
What I think went wrong in this is that the data set doesn't have enough Class 1 data in the test case as you can see in the confusion matrix
n=33 | Pred Class 0 | Pred Class 1 |
---|---|---|
Actual Class 0 | 30 | 3 |
Actual Class 1 | 0 | 0 |
So these models might not be that good in predicting True Class 1. Also the data set is very small