Project: Ear Recognition

Yifei Feng

Zehui Jiang

Guangxing Ren


The process of precisely recognize people by ears has been getting major attention in recent years. It represents an important step in the biometric research, especially as a complement to face recognition systems which have difficult in real conditions. This is due to the great variation in shapes, variable lighting conditions, and the changing profile shape which is a planar representation of a complex object. We present an ear recognition system involving a convolutional neural networks (CNN) to identify a person given an input image.

Identify your problem

In this project, we are given a set of left ear image from people with different identities. For each ear(each person). Four images are given, we for ears “in the wild” (there is no constraint in the way of taking the photo), two for ear images taken with a “donut device” that serves as a background and somewhat controls lighting. There are 4*195(individual ear) in the dataset.
(Data source:

The term ear biometrics refers to automatic human identification on the basis of the ear physiological (anatomical) features. The identification is performed on the basis of the features which are usually calculated from captured 2D or 3D ear images (using pattern recognition and image processing techniques).

Being able to identify human ears automatically has a significant meaning. It can be useful in several domains such as:

  • 1. Criminal determination
  • 2. Enfant identification
  • 3. Medical research

    This is a classification problem with 195 categories and in each category there are 4 sample images. Our goal is: given a random ear image in this dataset, being able to identify which person this ear belongs to and display all the other ears belong to this person.

    This is no minor problem, which involves several steps:

  • Choose a proper method after researching and comparison between all possible method
  • Building a model with the chosen method
  • Input all the data
  • Train a model that is capable of identifying all images with a rather high accuracy, higher than 0.8 preferably

    Background Survey

    Possible Solutions:

  • 1. Neural Network
  • 2. Hidden Markov Model
  • 3. Edge Detection

    Choice of Method

    Our final choice is Neural network, to be specific, Convolutional Neural Network for the following reasons:
  • HMM is a generative, probabilistical model. It’s a generative, probabilistic model where you try to model the process generating the training sequences, or more precisely, the distribution over the sequences of observations. But in our case, we are trying to categorize hundreds of images and CNN, as a deterministic model, will be a better fit.
  • Edge detection can be used to extract the feature of the shape of ears. But speaking of identifying and categorizing them, it lacks abundance in feature information.
  • CNN is widely used in image concerning problems such as image classification, image detection and image generating, etc.

    About CNN

    How CNN works:

    CNN is widely used for image recognition for a long period of time. There are various well-developed network structure and framework/platform to utilize. This is the main reason why we choose CNN as a solution. Convolutional neuron network is mainly consist of convolutional layer, pooling layer and fully connected layer. Convolutional layer basically take filters on a single image and each filter picks different singal(area of the picture) in hortional, vertical and diagonal directions.
    The aim of those filters is creating a map of each slices in image that feature occurs. So convolutional networks perform a sort of search. During search, a match is found, it will be mapped into a feature space where location of the match will be saved. By repeating above steps, convolutional layer can record features of the single image in different directions. After convolutional layer, it might be passed into a nonlinear transform such as reLu or tanh, which compress input into a range between 1 and -1.
    Pooling layer has various kinds, such as maxPooling, average and downsample. In pooling layer, maps will be applied a slice/patch once a time. For example, maxPooling will take the largest value in the slice/patch and discard other information in maps. It is kind of compressing maps into smaller dimensions and save key feature in the same time. Fully connected layer will classify output on each node based on weights.

    Why we choose Keras:

    Keras is chosen as a developing framework for this project. keras is well developed and user friendly for starters. It provides various APIs. As a result, we can more focus on how to improve our network in high level rather than in debugging our network structure. . Keras is indeed more readable and concise, allowing us to build first end-to-end deep learning models faster, while skipping the implementation details .Kears is built on top of Tensorflow, which is widely used for deep learning.

    Reproducing the baseline:

    We find a existing project of face recognition based on Keras, which is similar to our project in some way. However, there is still a big difference between it and our work. So we use its network frame as a reference and build our own convolutional network with 2 convolutional layers, 2 pooling layers and 1 fully connected layer. Also, we changeed the final activation function to softmax because it has better performence in catogorizing.

    Primal Implement

    from import imread_collection
    import cv2
    import numpy as np
    from keras.utils import np_utils 
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    #loading dataset from directory and resize every picture to 200x200 dmension
    def load_data(path, size):
        #creating a collection with the available images
        image = imread_collection(path)
        image_set = []
        for n in image:
            n = cv2.cvtColor(n,cv2.COLOR_RGB2GRAY)
            n = cv2.resize(n,(size,size)) 
            n = n / 255
        return image_set 
    def set_init(dataset, train_set_ratio, valid_set_ratio, test_set_ratio):
        #creating label set for all images
        label = np.empty(195*4)
        for i in range(195):
            label[i*4:i*4+4] = i
        label = label.astype(
        label = np_utils.to_categorical(label, 195)#transfer to one-hot matrix
        train_num = 780*train_set_ratio
        train_num = int(train_num)
        valid_num = 780*valid_set_ratio
        valid_num = int(valid_num)
        test_num = 780*test_set_ratio
        test_num = int(test_num) 
        train_data = np.empty((train_num,200,200))  #creating numpy array for different datasets
        train_label = np.empty((train_num,195))   
        valid_data = np.empty((valid_num, 200,200))   
        valid_label = np.empty((valid_num,195))   
        test_data = np.empty((test_num,200,200))  
        test_label = np.empty((test_num,195)) 
        x_test_tot = np.empty((valid_num + test_num,200,200))
        y_test_tot = np.empty((valid_num + test_num,195))
        #split into train set and validation-test set
        train_data, x_test_tot, train_label, y_test_tot = train_test_split(dataset, label, test_size = 1-train_set_ratio)
        #split validation-test set into validation and test set
        valid_data, test_data, valid_label, test_label = train_test_split(x_test_tot, y_test_tot, test_size = test_set_ratio/(valid_set_ratio + test_set_ratio))
        train_data = np.asarray(train_data)
        x_test_tot = np.asarray(x_test_tot)
        valid_data = np.asarray(valid_data)
        test_data = np.asarray(test_data)
        result = [(train_data, train_label), (valid_data, valid_label),(test_data, test_label)]
        return result
    data_set = load_data('original/*.jpg',200)
    data = set_init(data_set, 0.8, 0.1, 0.1)
    from keras.models import Sequential
    from keras.layers import Dense, Activation, Flatten
    from keras.layers import Conv2D, MaxPooling2D,AveragePooling2D
    from PIL import Image
    def train(data, batch_size, epochs, nb_filters, pool_size, kernel_size):
        np.random.seed(1337)  # for reproducibility
        img_rows, img_cols = 200, 200  # width and height of pictures
        nb_classes = 195  # number of classes
        input_shape = (img_rows, img_cols,1)  # dimenstion
        [(X_train, Y_train), (X_valid, Y_valid),(X_test, Y_test)] = data
        X_train = X_train[:,:,:,np.newaxis]  # add one dimenstion, keras required. total 4 dimension.
        print('dimension of train set:', X_train.shape,Y_train.shape)
        print('dimension of test set:', X_test.shape,Y_test.shape)
        model = Sequential()
        model.add(Conv2D(6,kernel_size,input_shape=input_shape,strides=1))  # convolution layer 1
        model.add(AveragePooling2D(pool_size,strides=2))  # pooling layer
        model.add(Conv2D(12,kernel_size,strides=1))  # convolution layer 2
        model.add(AveragePooling2D(pool_size,strides=2))  # pooling layer
        model.add(Flatten())  # 1 denmension
        model.add(Dense(nb_classes))  # fully connected layer
        model.add(Activation('softmax'))  # 
        # compile
        # fit, Y_train, batch_size, epochs,verbose=1, validation_data=(X_valid, Y_valid))
        # evaluate
        score = model.evaluate(X_test, Y_test, verbose=0)
        print('Test score:', score[0])
        print('Test accuracy:', score[1])
        y_pred = model.predict(X_test)
        y_pred = y_pred.argmax(axis=1)   # check which class is predicted 
        for i in range(len(y_pred)):
        #     oneimg = X_test[i,:,:,0]*256
        #     im = Image.fromarray(oneimg)
            print('%d person is predicted as %dperson'%(Y_test.argmax(axis=1).item(i),y_pred[i]))
    train(data, 100, 25, 64, 4, 10)
    dimension of train set: (624, 200, 200, 1) (624, 195)
    dimension of test set: (78, 200, 200, 1) (78, 195)
    Train on 624 samples, validate on 78 samples
    1. Crop

    2. Regularization and dropout

    3. K-fold

    4. Augmentation

    5. Final Solution

    1. Image Cropping

    As we further observe the images, we found out that the noise of a huge portion of images is very huge. So we cropped each picture so that what left in each image has "more" ears and less noise such as hair, neck, donut device,etc.


    And this improvement is implemented in all parts below.

    2. Regularization and dropout

    L2 regularization is a technique we are going to discuss in more details. Simply put, it introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term.

    Dropout is implemented per-layer in a neural network. It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer. Dropout may be implemented on any or all hidden layers in the network as well as the visible or input layer. It is not used on the output layer.


    Result for Regularization


    L2 regularization works significantly for common neural networks, but not good enough for CNN. I believe it's becasue the nature of CNN. For CNN, unlike common neural networks, back propagation relies on chain rule in a linear fashion. But for CNN, not all layers are fully connected until the fully connected layer. So the reduction in gradient decende is not as efficient.

    Dropout is generally less effective at regularizing convolutional layers. Since convolutional layers have few parameters, they need less regularization to begin with. Furthermore, because of the spatial relationships encoded in feature maps, activations can become highly correlated. This renders dropout ineffective.

    3. K-fold

    Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

    How to Implement:

  • Shuffle the dataset randomly.
  • Split the dataset into k groups
  • For each unique group: Take the group as a hold out or test data set; Take the remaining groups as a training data set; Fit a model on the training set and evaluate it on the test set; Retain the evaluation score and discard the model;
  • Summarize the skill of the model using the sample of model evaluation scores

    For k-fold, our number of fold is 2. During the process of spliting data into training and test set, we decided to use 2 pictures in traning, 1 pictures in validation, 1 pictures in test in random order. Otherwise, it will be compiled error because we have to make sure every class has at least one picture in test set.

    Result for fold No.1

    Result for fold No.2


    K-fold doesn't improve our result because cross-validation mainly reduce the inner bias within the dataset. But in our case, as the source of image is very limited, K-fold don't have a huge impact on improvement.

    4. Augmentation

    The main problem from baseline is overfitting. As a result, we are trying to utlize data augmentation to provide more traning data. So we use Image Generator provided in keras. We decided to use zoom and rotation. Other parameters are not effiecent in our input dataset. Or our memory(16GB) is not enough to support this modification.



    As we can discover, there is a some improvement in test accuracy. But still, it's not at all a positive result as a test result.

    5. Final Solution

    With all the improvement method we implemented, the problem still exists. However, we are rather satisfied that our network works very well on the training set. So, in order to solve this problem and achieve our goal, we come up with a solution anyway, with is our compromise: we take the whole data set as the train set. In this case, all pictures can be identified and categorized. What's more, given a random image in this image set, we are able to display all other ear images that belong to the same person.



    Limitted by the time given and our knowledge and experience in machine learning, although some improvements above can increase the training dataset, there is another problem we can never fix: the validation accuracy is decreasing when training accuracy is increasing.

    This is mainly because the dataset is too small that each class only has 4 pictures. And in the 4 pictures, we have to divide into training, validation and test set. And this cause very severe overfitting in our validation and test process.

    Besides, though we have to admit that keras is a user-friendly and straightforward framework, it has some drawbacks that all layer is encapsulated well. The verbose mode is not enough for providing details that we can know what features that the cnn generated.

    If more time is allowed, we might have more options to ameliorate our implementation.

    Here are several possible options we considered:

    1. CNN-HMM Hybrid model

    2. Using edge detection to extract ear image from the original image to reduce noice


