ML-Practice-On-Fertility-Data-Set

Practicing Various Models on Fertility Data Set

Abstract

100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits

Data Set Characteristics: Multivariate
Attribute Characteristics: Real
Area: Life
Number Of Attributes:10
Number Of Records:100

To get the data set for yourself click here
To know more about the data click here

Data Set Attribute Information

Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)
Age at the time of analysis. 18-36 (0, 1)
Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)
Accident or serious trauma 1) yes, 2) no. (0, 1)
Surgical intervention 1) yes, 2) no. (0, 1)
High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)
Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)
Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)
Number of hours spent sitting per day ene-16 (0, 1)
Output: Diagnosis normal (N), altered (O)

Methodology

Used 3 models for now :

KNN
LogisticRegression
SVC
ANN

Procedure

Data was loded into python by using Pandas library.
Data was split into training and testing data using the train_test_split module in the model__selection library of sklearn
For KNN a loop was used to check different n_neighbours values in between 1 to 40. The mean error_rate was plotted and can be seen that for any value greater than 1 the model has least error so taking the least value of n_neigbours i.e. 2.
For SVC again from model__selection library of sklearn module GridSearchCV is used so as to find the best combination for the SVC parameters C and gamma.
The best combination of parameters was found to be 'C': 0.1 & 'gamma': 0.1
Using this estimator data was fit into the SVC model and trained and tested.
For Logistic Regression, simply the data was fit into the model and values were tested.
Evaluation of the models was done by sklearn's metrics library. Modules classification_report and confusion_matrix were used to check the confidence/accuracy of the model.
For ANN, a simple shallow neural network was built with one hidden layer. The Keras library was used with the 'accuracy' metric.
Accuracy Scores of all the models can be seen as the same as 91%.

Final Remarks

What I think went wrong in this is that the data set doesn't have enough Class 1 data in the test case as you can see in the confusion matrix

n=33	Pred Class 0	Pred Class 1
Actual Class 0	30	3
Actual Class 1	0	0

So these models might not be that good in predicting True Class 1. Also the data set is very small

aritC / ML-Practice-Fertility-Data-Set

ML-Practice-On-Fertility-Data-Set

Abstract

Data Set Attribute Information

Methodology

Final Remarks

About

Languages