enyalien / plant-leaves-classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python Plant Leaves Classifier - Problem Description

This python program classifies plant leaves according to several attributes. The goal is to locate the plant leaf in a predefined species, according to the input parameters.

It uses the K-Nearest Neighbors algorithm from python's scikit-learn library. At the end of the execution, the user will get the classification of the leaf as well as an image of a similar leaf, according to the species.

In the actual context, we may find several applications for classification problems using Machine Learning techniques. From medicine applications using image recognition to biological classifications like this example. The wide range of useful scenarios where the use of cutting edge technology is more than justified makes this area of Computer Science one of the promises for the future.

Requisites

You will need to install Python along with the following libraries:

  • Python
  • Pillow
  • Matplotlib
  • Numpy
  • Scikit-learn
  • Scipy

Steps

  • Clone the repository
  • Open a terminal in the repository's location in your computer
  • Run the program with the following command: python main.py

Why K-Nearest Neighbors?

The nature of the problem solved in this context is a classification problem. Hence, with a proper training data, good results can be obtained though the use of the K-Nearest Neighbors algorithm.

The negative aspects in this case are: there will be no learning process and the computing processing needed can be a little high. However, for this real life example, we have all the resources needed.

The data

The data set to be used within this solution has a total of 36 different species of plant leaves classified according to several parameters. In total, there are 340 records of different specimens.

In the following figure, we will take a look of how the data looks like, in the first 5 records.

alt text

In total, there are 16 parameters for each record, which describe both shape and texture for every leaf:

  • A : Class (Species)
  • B: Specimen number
  • C: Eccentricity
  • D: Aspect ratio
  • E: Elongation
  • F: Solidity
  • G: Stochastic Convexity
  • H: Isoperimetric Factor
  • I: Maximal Indentation Depth
  • J: Lobedness
  • K: Average Intensity
  • L: Average Contrast
  • M: Smoothness
  • N: Third moment
  • O: Uniformity
  • P: Entropy

Also, the possible species for classification are:

  • Quercus suber
  • Salix atrocinera
  • Populus nigra
  • Alnus sp.
  • Quercus robur
  • Crataegus monogyna
  • Ilex aquifolium
  • Nerium oleander
  • Betula pubescens
  • Tilia tomentosa
  • Acer palmatum
  • Celtis sp.
  • Corylus avellana
  • Castanea sativa
  • Populus alba
  • Acer negundo
  • Taxus bacatta
  • Papaver sp.
  • Polypolium vulgare
  • Pinus sp.
  • Fraxinius sp.
  • Primula vulgaris
  • Erodium sp.
  • Bougainvillea sp.
  • Arisarum vulgare
  • Euonymus japonicus
  • Ilex perado ssp. azorica
  • Magnolia soulangeana
  • Buxus sempervirens
  • Urtica dioica
  • Podocarpus sp.
  • Acca sellowiana
  • Hydrangea sp.
  • Pseudosasa japonica
  • Magnolia grandiflora
  • Geranium sp.

The main idea is to input a complete vector of characteristics and to get in the output the correspondent classification, as well as a picture of one of the specimens of the species.

Why K value = 20?

Several discussions talk about how to choose a proper k value for the K-Nearest Neighbors algorithm. For this case, an iteration was run to determine the biggest change in the chart. From the range from 1 to 40, here are the results obtained.

alt text

From the first k values, we see a tendency to get a lower accuracy, however, low K values might tend to be disguising for the algorithm because the scope might not be the most adequate. The best "elbow effect" we can get is around k = 20. In this case, although the accuracy is poorer, we have a bigger spectrum of items to be compared with. Of course we are sacrificing more resources for a more accurate output.

K-Nearest Neighbors parameters explanation

Making use of the scikit-learn's K-Nearest Neighbors classifier simplifies a lot of job, however, it needs to be very carefully parameterized.

Neighbors choose algorithm

This is the way the algorithm searches for the neighbors of a specific record. With scikit-learn, we have the ball_tree, kd_tree, brute and auto. They are better or worse depending on the data set naturalness, however, the auto option tries to choose the best approach for the specific case. Hence, we specified algorithm = 'auto'

K value

Previously justified. This is the parameter that specifies the number of neighbors for the input to be compared with.

Distance Metric

This is the way that the algorithm measures the distances with the neighbors. After trying to find the optimal, the BrayCurtis metric was choosen. Given the naturalness of the data, better model accuracies were obtained.

References

  • Dudani, S. A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, (4), 325-327.
  • Hu, L., Huang, M., Ke, S. et al. The distance function effect on k-nearest neighbor classification for medical datasets.
  • Enas, G. G., & Choi, S. C. (1986). Choice of the smoothing parameter and efficiency of k-nearest neighbor classification. In Statistical Methods of Discrimination and Classification (pp. 235-244). Pergamon.

About


Languages

Language:Python 100.0%