02-Data-Science-My-Open-The-Iris

My Open The Iris

Remember to git add && git commit && git push each exercise!

We will execute your function with our test(s), please DO NOT PROVIDE ANY TEST(S) in your file

For each exercise, you will have to create a folder and in this folder, you will have additional files that contain your work. Folder names are provided at the beginning of each exercise under submit directory and specific file names for each exercise are also provided at the beginning of each exercise under submit file(s).

My Open The Iris
Submit directory	.
Submit file	my_open_the_iris.ipynb

Description

Open the iris!

A common mistake businesses make is to assume machine learning is magic, so it's okay to skip thinking about what it means to do the task well.

Introduction

Time do to an end-to-end project in data science. which means:

Loading the dataset.
Summarizing the dataset.
Visualizing the dataset.
Evaluating some algorithms.
Making some predictions.

A must-see example of data science is the iris dataset. We will predict which class of iris plant a plant belongs to based on its characteristics.

Iris versicolor - Iris setosa - Iris virginica

Where to get started?

Environment. We will use Jupyter.

In Data Science, the winning combo is pandas (and/or numpy), matplotlib, sklearn (and/or keras). In this project, we will use:

pandas to load the data
matplotlib to do the visualization
sklearn to do the prediction

Load dataset

url = "URL"
dataset = read_csv(url)

Summarizing the dataset

A - Printing dataset dimension

print(dataset.shape)
# should something like: (150, 5)

B - It is also always a good idea to eyeball your data.

print(dataset.head(20))

C - Statistical Summary The statistical summary includes the count, mean, the min and max values, and some percentiles.

print(dataset.describe())

D - Class Distribution Group by to see how our data are distributed.

print(dataset.groupby('class').size())

Visualization

After having a basic idea about our dataset, we need to extend it with some visualizations.

For this dataset, we will focus on two types of plots:

Univariate plots to better understand each attribute.
Multivariate plots to better understand the relationships between attributes.

A - Univariate

from pandas import read_csv
from matplotlib import pyplot
dataset.hist()
pyplot.show()

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

B - Multivariate

from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
scatter_matrix(dataset)
pyplot.show()

We can note the diagonal grouping of some pairs of attributes. It suggests a high correlation and a predictable relationship. :-)

Building our code to evaluate some algorithms

it is time to create some data models and estimate their accuracy.

Here is what we are going to cover in this step:

Separate a validation dataset.

array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

Experiment!

Build multiple different models from different algorithms.

# DecisionTree
model = DecisionTreeClassifier()
GaussianNB

model = GaussianNB()
KNeighbors

model = KNeighborsClassifier()
LogisticRegression

model = LogisticRegression(solver='liblinear', multi_class='ovr')
LinearDiscriminant

model = LinearDiscriminantAnalysis()
SVM

model = SVC(gamma='auto')

How to run the model?

cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')

Improving

Improving your data and your model is an iterative process, and you will have to loop through this process repeatedly.

Now it's time to do it!

Technical specifications

You will create an end-to-end analysis of the dataset.

Part I Load data

Create a function load_dataset(). It doesn't take any parameter. You will load the dataset and returns it.

Part II Summarizing the dataset

Summarizing the dataset: Create a function summarize_dataset(dataset), it will print (in this order):

its shape
its ten first lines
its statistical summary
Its distribution

Example:

Dataset dimension: (37, 5) First 10 rows of dataset: sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa Statistical summary: sepal-length sepal-width petal-length petal-width count 12.000000 12.000000 12.000000 12.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.000000 1.000000 0.100000 25 5.100000 2.800000 1.600000 0.300000 50 5.800000 3.000000 4.350000 1.300000 75 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000

Class Distribution: class Iris-setosa 12 Iris-versicolor 12 Iris-virginica 13 dtype: int64

Part III

Create two functions print_plot_univariate(dataset) and print_plot_multivariate(dataset). Each function will setup and show its corresponding plot.

Part IV

Create a function my_print_and_test_models(dataset), it will (in this order) DecisionTree, GaussianNB, KNeighbors, LogisticRegression, LinearDiscriminant, and SVM

Remember to split your dataset in two: train and validation.

Following this format:

# print('PERCENTs: PERCENTf (PERCENTf)' PERCENT (model_name, cv_results.mean(), cv_results.std()))
DecisionTree: 0.927191 (0.043263)
GaussianNB: 0.928858 (0.052113)
KNeighbors: 0.937191 (0.056322)
LogisticRegression: 0.920897 (0.043263)
LinearDiscriminant: 0.923974 (0.040110)
SVM: 0.973972 (0.032083)

Iris Dataset

Gandalf will not accept any pip install XXXX inside your file.

https://towardsdatascience.com/3-great-design-patterns-for-data-science-workflows-d3bf162d74e6/archive/cs/cs107/cs107.1174/guide_make.html

abdullaabdukulov / 02-Data-Science-My-Open-The-Iris-