twhipple / Mushroom_Classifier_Model

Using a classification model on categorical data in order to identify poisonous verses edible mushrooms.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mushroom Classification

Predicting the class of mushrooms through classification - watch out for those deadly Fly Agarics. Source: 'Walkman200', freeimages.com

Intro

In this repo I will use classification models to identify if mushrooms are edible or poisonous. The data is an old UCI dataset from about 30 years ago - but still in good shape today and can be found on Kaggle!

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

A very poisonous Destroying Angle mushroom. Source: maarten_kruit, freeimages.com

README Outline

  • Introduction
  • Project Summary
  • Repo Contents
  • Prerequisites
  • Feature and Definitions
  • Results
  • Future Work
  • Built With, Contributors, Authors, Acknowledgments

A fun color-coded bar plot of dataset mushrooms based on color.

Repo Contents

This repo contains the following:

  • README.md - this is where you are now!
  • Jupyter_Notebook.ipynb - the Jupyter Notebook containing the finalized code for this project.
  • PCA_and_Label_Encoding_Notebook.ipynb - a notebook where I tried some different approaches to issues.
  • LICENSE.md - the required license information.
  • mushrooms.csv - the file containing the dataset in csv.
  • CONTRIBUTING.md
  • Images

Libraries & Prerequisites

These are the libraries that I used in this project.

  • numpy as np
  • pandas as pd
  • matplotlib.pyplot as plt
  • %matplotlib inline
  • seaborn as sns
  • folium
  • datetime as dt
  • from sklearn.model_selection import train_test_split
  • from sklearn.model_selection import cross_val_score
  • from sklearn.model_selection import GridSearchCV
  • from sklearn.metrics import accuracy_score
  • from sklearn.metrics import f1_score
  • from sklearn.metrics import confusion_matrix
  • from sklearn.metrics import classification_report
  • from sklearn.metrics import confusion_matrix
  • from sklearn.linear_model import LogisticRegression
  • from sklearn.ensemble import RandomForestClassifier
  • import xgboost as xgb

Features

  • Attribute Information: (classes: edible=e, poisonous=p)
  • cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
  • cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
  • cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
  • bruises: bruises=t,no=f
  • odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
  • gill-attachment: attached=a,descending=d,free=f,notched=n
  • gill-spacing: close=c,crowded=w,distant=d
  • gill-size: broad=b,narrow=n
  • gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
  • stalk-shape: enlarging=e,tapering=t
  • stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
  • stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
  • stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
  • stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
  • stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
  • veil-type: partial=p,universal=u
  • veil-color: brown=n,orange=o,white=w,yellow=y
  • ring-number: none=n,one=o,two=t
  • ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
  • spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
  • population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
  • habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

Mushroom bar plot by class and color.

Models

These are the models that I tried in this project:

  • LogisticRegression
  • Random Forest
  • XGBoost
  • K-Nearest Neighor (KNN)
  • Support Vector Model (SVC)
  • Stochastic Gradient Decent

Bar plot looking at different mushroom habitats and their class.

Conclusions

All of my models were above 90% accuracy. The Random Forest Model had the best score with KNN and SVC pretty close.

Originally all of my models were over-fitting and getting too close to 100% accuracy. I tried using PCA components to reduce the number of features but still found that the models needed less columns. I tried using Feature Selection as well but found that it wasn't working with all the similar column names from the one hot encoding.

This has been an interesting data set with a few aspects that have really challenged me - in particular the purely numerous categorical columns and the feature selection. I looked at a few examples of this Kaggle and noticed many people used Label Encoding verses One Hot Encoding. Plus, the PCA verses Feature Selection both added a lot of time to my project and I needed to review much data science and statistical information.

Looking at whether mushrooms are edible or not based on odor.

Future Work

I would love to go find some mushrooms around my house and categorize them in order to determine if they are poisonous or not! For now I just need to work on fixing my models. I want to do move work with feature selection - especially using the mushroom identifiers that I am most familiar with (such as size, cap-color, and location). Perhaps my model would fair better (though with lower accuracy) if I took out more of the the features.

Not sure I want to go shrooming any time soon after working with this data. Source: mm ramos, freeimages.com

Built With:

Jupyter Notebook Python 3.0 scikit.learn

Contributing

Please read CONTRIBUTING.md for details

Authors

Thomas Whipple

License

Please read LICENSE.md for details

Acknowledgments

Thanks to Kaggle Donated to UCI Machine Learning 27 April 1987

About

Using a classification model on categorical data in order to identify poisonous verses edible mushrooms.

License:Other


Languages

Language:Jupyter Notebook 100.0%