Locality-constrained-Linear-Coding-for-Image-Classification-2010-Reprodue

Project description

Our group worked on Locality-constrained Linear Coding for Image Classification (2010). In this project, we went through three stages and achieved the targets we set up in the proposal. Initially (Feb.15 – Feb.22) we tried to understand the mathematical and technical background of the proposed algorithm. In the second stage (Feb.22- Mar.7) we wrote the code for the algorithm in python and focused on reproducing the experiment result in the paper. Finally in last week (Mar.11-Mar.18), we made some final adjustment to our code and slightly improved its efficiency.

Motivations

The paper introduced an innovative coding method for image classification called Locality-constrained Linear Coding (LLC). Image classification is perhaps the most important part of digital image analysis, by which a machine could technically “see” the world. Image classification helps us better analysis and organize the digital graphic information and extracts useful intelligence that could be beneficial to people’s life. Therefore, any improvement in the accuracy and efficiency in an image classification algorithm would be invaluable. The traditional SPM approach based on bag-of-feature requires computationally expensive non-linear classifier to achieve good performance. Motivated by the yearn for higher efficiency without a giant loss of accuracy, the paper’s author replaced the VQ coding with LLC and successfully improved to classification algorithm.

Algorithm Introduction

Typically SPM image classification algorithm takes 3 steps. Firstly, we obtain graphic information by translating each feature point on an image into a feature vector called descriptor via SIFT or other algorithm. Intuitively we can think of descriptors as digital representation of the image’s graphic features. Some elementary image classification methods compare descriptors directly, which turns out to be inaccurate. In a more advanced SPM algorithm, the descriptor matrix goes through the coding process in which each descriptor vector gets codified into a code vector. Usually a codebook is utilized in the coding process, which might be intuitively deemed as a “book” of general features. We approximate each descriptor with a linear combination of column vectors in the codebook and use the vector of coefficient (usually weight coefficients) as the code for the descriptor. Intuitively one can imagine this process as using the general features in the codebook to represent an original feature in the image and record the features we used. We have three main coding methods: VQ, Sparse Coding, and LLC. In the inaccurate VQ method, the strict constraints on the norm of code Ci dictates that only one vector, the one that is nearest to the descriptor, is used to represent the original feature. In the Sparse Coding method, use of multiple feature vectors is allowed and the length adaptor is utilized to reduce the involvment of irrelevant features and thus guarantees a sparse code. However, since there are too many possible selections of features in a codebook, highly correlated input descriptors might end up with codes of low correlations. Compared with previous two methods, the LLC is more accurate and computationally inexpensive. In LLC, the local adaptor gives largest weight to the general feature vectors that reside near the original descriptor and thus achieves good accuracy and preserves correlations of input descriptors. Furthermore, if we decide only to use k nearest features to represent the descriptor, the approximation will be reduced to a simple least square problem and the computational complexity will significantly decrease. The second step is pooling, in which the graphic information in a code matrix is condensed into a single column vector. Max pooling is used in the algorithm to record the most prominent appearance of each general feature. Length normalization is utilized to reshape the vector into the form of a histogram. After we get the condensed information vector representing the features of the image, the SPM kicks in and generates 20 more condensed information vectors by dividing the graph into 4 and 16 sub-regions respectively and repeat the previous work on each sub-region. The final step is to concatenate these vectors together into an extremely long single column vector, which is the final representation of this image. The spatial information is recorded in the concatenation process.

The classification process is relatively easier. Since we have already achieved decent accuracy in the coding and pooling steps, we can utilize linear classifiers to pursue higher efficiency in this step. In this project we used linear classifier SVM. We feed the classifier with a training group of 870 images with their labels to generate a classification model before we eventually input the testing group of 2126 images for classification. By comparing the original labels with the labels assigned by the model, we can conclude the accurate of the algorithm.

Reproduced Experimental Results

We reproduced the given MATLAB code with Python and packages numpy, math, os.path, scipy.io, pickle and sklearn. Just like the MATLAB code, we separated our Python code into three files, where LLC_test_appr.py is the main function, LLC_coding_appr.py is the coding function and LLC_pooling.py is the pooling function. We used the first 29 classes of the Caltech101 dataset. The total number of images that we used was 2996. We chose three levels of spatial block structure. The pyramids are one, two and four, generating one, four and sixteen bins for each level. For local coding, we chose five as the number of neighbors. For linear SVM, we chose ten as the regularization parameter. Number of random test on the dataset is ten. We first used the given MATLAB function “extr_sift.m” to extract SIFT descriptors for the 2996 images to store them in the “data” folder. Then we used Python function scipy.io.loadmat to load information of the descriptors and the given code book. Then we used LLC_coding and LLC_pooling to acquire the pooling feature of each image and store them and their labels into pickle files. We then random picked 870 images as training set and the rest as testing sets. We trained our model with the training set using function svm.LinearSVC(). Then we predicted the label for each image in testing set using the model and calculate the accuracy for each class. We then calculated the average accuracy over all the classes. We used a for-loop, starting from randomly separating training set and testing set, to acquire the average overall accuracy of ten times. We ran the program with six different number of training samples, which are 5, 10, 15, 20, 25 and 30, and compared our average overall accuracy with the paper.

As we can see, our experimental results are very close to the results from the paper. As the number of training samples increased, our result of accuracy became closer to the result from the paper. When the number of training samples is 30, our accuracy was 71.01% while the paper’s accuracy is 73.44%. We believe that some of the difference between the experimental result and the result of the paper is due to the difference in the size of the code book. The paper uses a code book that has a size of 128X2048 while the code book that was given to us was 128X1024. This means that the paper has more possible elements to pick from to represent an image.

vergil9312 / Locality-constrained-Linear-Coding-for-Image-Classification-2010-Reprodue

Locality-constrained-Linear-Coding-for-Image-Classification-2010-Reprodue

Project description

Motivations

Algorithm Introduction

Reproduced Experimental Results

About

Languages