Remember to git add && git commit && git push each exercise!
We will execute your function with our test(s), please DO NOT PROVIDE ANY TEST(S) in your file
For each exercise, you will have to create a folder and in this folder, you will have additional files that contain your work. Folder names are provided at the beginning of each exercise under submit directory
and specific file names for each exercise are also provided at the beginning of each exercise under submit file(s)
.
My Open The Iris | |
---|---|
Submit directory | . |
Submit file | my_open_the_iris.ipynb |
Open the iris!
A common mistake businesses make is to assume machine learning is magic, so it's okay to skip thinking about what it means to do the task well.
Time do to an end-to-end project in data science. which means:
- Loading the dataset.
- Summarizing the dataset.
- Visualizing the dataset.
- Evaluating some algorithms.
- Making some predictions.
A must-see example of data science is the iris dataset.
We will predict which class of iris plant
a plant belongs to based on its characteristics.
Iris versicolor - Iris setosa - Iris virginica
Environment. We will use Jupyter.
In Data Science, the winning combo is pandas (and/or numpy), matplotlib, sklearn (and/or keras). In this project, we will use:
- pandas to load the data
- matplotlib to do the visualization
- sklearn to do the prediction
url = "URL"
dataset = read_csv(url)
A - Printing dataset dimension
print(dataset.shape)
# should something like: (150, 5)
B - It is also always a good idea to eyeball your data.
print(dataset.head(20))
C - Statistical Summary The statistical summary includes the count, mean, the min and max values, and some percentiles.
print(dataset.describe())
D - Class Distribution Group by to see how our data are distributed.
print(dataset.groupby('class').size())
After having a basic idea about our dataset, we need to extend it with some visualizations.
For this dataset, we will focus on two types of plots:
- Univariate plots to better understand each attribute.
- Multivariate plots to better understand the relationships between attributes.
A - Univariate
from pandas import read_csv from matplotlib import pyplot
dataset.hist() pyplot.show()
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
B - Multivariate
from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot
scatter_matrix(dataset) pyplot.show()
We can note the diagonal grouping of some pairs of attributes. It suggests a high correlation and a predictable relationship. :-)
it is time to create some data models and estimate their accuracy.
Here is what we are going to cover in this step:
Separate a validation dataset.
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
Build multiple different models from different algorithms.
# DecisionTree model = DecisionTreeClassifier()
model = GaussianNB()
model = KNeighborsClassifier()
model = LogisticRegression(solver='liblinear', multi_class='ovr')
model = LinearDiscriminantAnalysis()
model = SVC(gamma='auto')
How to run the model?
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
Improving your data and your model is an iterative process, and you will have to loop through this process repeatedly.
Now it's time to do it!
You will create an end-to-end analysis of the dataset.
Create a function load_dataset()
. It doesn't take any parameter. You will load the dataset and returns it.
Summarizing the dataset:
Create a function summarize_dataset(dataset)
, it will print (in this order):
its shape
its ten first lines
its statistical summary
Its distribution
Example:
Dataset dimension: (37, 5)
First 10 rows of dataset: sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa
Statistical summary: sepal-length sepal-width petal-length petal-width count 12.000000 12.000000 12.000000 12.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.000000 1.000000 0.100000 25 5.100000 2.800000 1.600000 0.300000 50 5.800000 3.000000 4.350000 1.300000 75 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000
Class Distribution: class Iris-setosa 12 Iris-versicolor 12 Iris-virginica 13 dtype: int64
Create two functions print_plot_univariate(dataset)
and print_plot_multivariate(dataset)
. Each function will setup and show its corresponding plot.
Create a function my_print_and_test_models(dataset)
, it will (in this order)
DecisionTree, GaussianNB, KNeighbors, LogisticRegression, LinearDiscriminant, and SVM
Remember to split your dataset in two: train and validation.
Following this format:
# print('PERCENTs: PERCENTf (PERCENTf)' PERCENT (model_name, cv_results.mean(), cv_results.std()))
DecisionTree: 0.927191 (0.043263)
GaussianNB: 0.928858 (0.052113)
KNeighbors: 0.937191 (0.056322)
LogisticRegression: 0.920897 (0.043263)
LinearDiscriminant: 0.923974 (0.040110)
SVM: 0.973972 (0.032083)
Gandalf will not accept any pip install XXXX
inside your file.