Exploration of a Methodology for Hyperspectral Band Selection using XGBoost and PCA

Idea of this project

The classification of hyperspectral images involves utilizing multiple features to identify and classify each pixel's content accurately. PCA is a mathematical technique that reduces the dimensionality of hyperspectral images by identifying the most significant patterns of variability.

This paper aims to evaluate the effectiveness of Principal Component Analysis (PCA) based band selection for hyperspectral image classification using XGBoost feature importance scores.

To ensure that PCA uses the best set of bands to explain the variation, bands are proactively removed in the feature selection chain before dimensionality reduction.

The paper details the techniques employed, including XGBoost feature importance scoring. Additionally, a flowchart is included to illustrate the process in its entirety.

The results demonstrate that the algorithm effectively reduces the number of bands while preserving crucial spectral data. This results in a slight decrease in the number of principal components, indicating the algorithm's effectiveness in locating and removing unwanted bands. The performance comparison across all four datasets shows a slight gain in accuracy, with some datasets having a more pronounced improvement. The suggested methodology generates a list of bands to be removed before PCA to improve classification accuracy.

Overall, this method shows promise in improving the accuracy of HS image classification and offers opportunities for future research, such as investigating alternative feature importance analysis techniques and extending the approach to other tree-based machine learning algorithms.

Objectives

1 -To investigate the effectiveness of incorporating preemptive band removal into Principal Component Analysis (PCA), with the main focus on optimizing accuracy over the reliability of explained variance.

2 -To employ XGBoost to generate scores for each band and to utilize these scores to determine the optimal sequence for band removal before PCA.

3 -To comprehensively compare the proposed algorithm with a conventional PCA application, providing a detailed, step-by-step description of the feature selection process for potential future research and analysis.

Project Structure

Methodology_for_Hyperspectral_Band_Selection
  data
    logs
    processed_images
    raw
  notebooks
  utils

Software implementation

All source code used to generate the results and figures in the paper are in the notebooks folder.

The data used in this study is provided in "raw" folder (a folder inside "data" folder).

Results generated by the code are saved in "logs" folder (a folder inside "data" folder).

The images generated using "" file are saved inside "processed_images" (a folder inside "data" folder).

Dependencies and libraries imported

You'll need a working Python environment to run the code. The Python version used is 3.8

The required dependencies are specified in the file requirenments.txt.

NumPy: A powerful Python library for numerical computing.
SciPy: A collection of mathematical algorithms and functions built on NumPy.
XGBClassifier: A gradient boosting machine learning algorithm provided by the XGBoost library.
Pandas: A versatile library for data manipulation and analysis.
Scikit-learn (sklearn): A comprehensive library of machine learning algorithms and tools for Python.

Reproducing the results

Clone this repository to have the exact same structure, or use this link https://github.com/RealXun/Methodology_for_Hyperspectral_Band_Selection.git.

$${\color{red}Make \space sure \space all \space datasets \space mentioned \space before \space are \space downloaded \space and \space saved \space in \space the \space "raw" \space folder \space (a \space folder \space inside \space "data" \space folder).}$$

There are two easy ways to run the code and see the results

- For Microsoft Windows Systems:

Use VisualStudioCode and run a Jupyter notebook (Recommended): The Jupyter notebook is called "PCA_band_removal.ipynb" and is located inside the folder named "notebooks": The notebook is divided into cells (some have text while others have code). Each cell can be executed using Shift + Enter.

Executing text cells does nothing, and executing code cells runs the code and produces its output.

To execute the whole notebook, run all cells in order.

When you execute the whole notebook, it will first run the Pavia University dataset, then the Pavia Center dataset, then the Salinas dataset, and last, the Indian Pines dataset.

You could run the first six cells to import the libraries, load the datasets, the main function, and then run one of the datasets independently.

- For Linux Systems:

Running the Python file named "hsi_pca(Linux_Only).py": This will show you a short menu where you can choose which dataset you want to use.

About Datasets

A. Salinas Dataset: Using the 224-band AVIRIS sensor, the Salinas dataset was collected over the Salinas Valley in California. It has a high spatial resolution of 3.7-meter pixels. Vegetables, bare soils, and grape fields are depicted in the 512 lines by 217 samples. The collection has 224 bands and 16 categories for ground objects.

Salinas (26.3 MB) | Salinas groundtruth (4.2 KB)

B. Indian Pines Dataset: This AVIRIS dataset was captured over the Indian Pines region in the northwestern part of Indiana, USA, and is available from the same website as the Salinas dataset. The original image size was 145 x 145 pixels with a spatial resolution of 20m and comprises 220 bands and 16 different ground object categories.

Indian Pines (6.0 MB) | Indian Pines groundtruth (1.1 KB)

C. Pavia Centre Dataset: Captured by a sensor known as the Reflective Optics System Imaging Spectrometer (ROSIS-3) over the city of Pavia, northern Italy. It is also available on the same website as the Salinas dataset. The image size is 1096 × 1096 pixels with a spatial resolution of 1.3 m, 102 bands in the range of 0.43–0.86 μm. Ground truth comprises nine classes.

Pavia Centre (123.6 MB) | Pavia Centre groundtruth (34.1 KB)

D. Pavia University Dataset: Captured by the same sensor of Pavia Centre. It is also available on the same website as the Salinas dataset. The image size is 610 × 340 pixels with a spatial resolution of 1.3 m, 103 bands in the range of 0.43–0.86 μm. Ground truth comprises nine classes.

Pavia University (33.2 MB) | Pavia University groundtruth (10.7 KB)

These datasets are from the official website of the official website of the Computational Intelligence Group of the University of the Basque Country (UPV/EHU).

Results

The total number of bands for each dataset after removal is compared with the original number of bands in the following table.

	Nº of bands before removal	Nº of bands after removal
Salinas	224	182
Indian Pines	220	200
Pavia Centre	102	81
Pavia University	103	62

Comparison between the number of components obtained without the algorithm and the number of components with the proposed methodology

	Nº principal components	Nº principal components with proposed methodology
Salinas	6	6
Indian Pines	69	64
Pavia Centre	14	9
Pavia University	16	11

Accuracy obtained without algorithm compared to with algorithm.

	Accuracy without algorithm	Accuracy with algorithm
Salinas	0.889543	0.895174
Indian Pines	0.720710	0.729007
Pavia Centre	0.892378	0.898099
Pavia University	0.837373	0.856461

The index of each removed band for all four datasets.

	Removed bands
Salinas	[147, 146, 221, 157, 102, 220, 36, 170, 38, 5, 134, 185, 154, 145, 107, 0, 125, 222, 211, 223, 204, 44, 175, 68, 162, 156, 177, 133, 158, 111, 63, 174, 219, 215, 150, 106, 164, 193, 109, 148, 206, 149]
Indian Pines	[192, 159, 12, 180, 103, 108, 122, 75, 165, 127, 109, 71, 124, 89, 30, 182, 67, 63, 173, 163]
Pavia Centre	[1, 0, 2, 3, 8, 6, 4, 89, 92, 5, 38, 57, 42, 87, 81, 78, 9, 80, 75, 88, 26]
Pavia University	[1, 0, 2, 3, 8, 4, 29, 52, 9, 49, 5, 57, 6, 43, 38, 73, 14, 26, 11, 102, 55, 22, 93, 94, 45, 98, 100, 80, 89, 46, 19, 18, 79, 51, 56, 50, 37, 59, 95, 36, 16]

Conclusion and Future works Section Writing

This study proposes a band selection approach for HS image classification that combines XBoost feature importance analysis with PCA to improve accuracy while reducing the computational cost. By removing the least important bands and applying PCA to the remaining ones, the proposed methodology is able to retain critical information while also reducing the dimensionality of the dataset. This approach can have a significant impact in various fields where HS imaging is commonly used, such as agriculture and medical imaging. It essentially helps to improve classification during the previous feature selection and dimensionality reduction phases, providing a flat increase in accuracy before hyperparameter tuning and model training, which in turn can lead to improved decision-making and overall better results.

The use of XGBoost with the histogram algorithm is an important step in the methodology, as it reduces the computational cost by approximating the gradients using histograms. This is significant because the process trains multiple models, so any reduction in terms of computational cost when training has a large effect. Throughout this research work, numerous challenges have been addressed. During experimentation, it has been necessary to determine an appropriate amount of variance to retain in the PCA transformation. Since HS images often contain a large number of highly correlated spectral bands, it is important to retain enough variance to capture the important information. This has been solved by observing different variance thresholds and selecting the one that produces a high accuracy while still requiring a manageable number of components. Finding the absolute optimal variance threshold is not critical for the experimentation since the aim is to use the algorithm to improve accuracy relative to applying PCA and XGBoost in a conventional way with the same parameters. Another challenge has been reproducibility and coherence through iterations since multiple XGBoost models and splits are used during the process. This has been solved by using a fixed seed that remains constant across the algorithm loops.

The proposed methodology offers several future opportunities for research. One possibility is to explore alternative feature importance analysis techniques to identify the most important bands before the removal process. Another avenue for research is to investigate the impact of different variance thresholds. The list of removed bands is also worthy of further research since looking at the reasons why removing them increases accuracy can help understand the underlying difficulties that those bands cause to PCA during dimensionality reduction, which ultimately leads to a reduction in accuracy. Additionally, this approach could be extended to other machine learning algorithms that are tree-based, namely decision trees or random forests.

References used

A. Mart ́ın-P ́erez, M. Villa, G. Vazquez, J. Sancho, G. Rosa, P. Sutradhar,M. Chavarr ́ıas, A. Lagares, E. Juarez, and C. Sanz, “Hyperparameter optimization for brain tumor classification with hyperspectral images,” in 2022 25th Euromicro Conference on Digital System Design (DSD), pp. 835–842, 2022.
P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza, “Advanced spectral classifiers for hyperspectral images: A review,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 1, pp. 8–32, 2017.
S. Sawant, M. Prabukumar, and S. Samiappan, “Ranking and grouping based feature selection for hyperspectral image classification,” 10 2018.
E. Sarhrouni, A. Hammouch, and D. Aboutajdine, “Band selection and classification of hyperspectral images by minimizing normalized mutual information,” 2nd International Conference on Innovative Computing Technology, INTECH 2012, pp. 184–189, 09 2012.
B. Martinez-Vega, R. Leon, H. Fabelo, S. Ortega, J. Pi ̃neiro, A. Szolna, M. Hern ́andez, C. Espino, A. J-O ́Shanahan, D. Carrera, S. Bisshopp, C. Sosa, M. Marquez, R. Camacho Gal ́an, M. Plaza, J. Morera, and G. Marrero Callico, “Most relevant spectral bands identification for brain cancer detection using hyperspectral imaging,” Sensors (Basel, Switzerland), vol. 19, 12 2019.
S. Bajwa, P. Bajcsy, P. Groves, and L. Tian, “Hyperspectral image datamining for band selection in agricultural applications,” Transactions of the ASAE. American Society of Agricultural Engineers, vol. 47, pp. 895–907, 01 2004.
K. Zhao, D. Valle, S. Popescu, X. Zhang, and B. Mallick, “Hyperspectralremote sensing of plant biochemistry using bayesian model averaging with variable and band selection,” Remote Sensing of Environment, vol. 132, pp. 102–119, 2013.
Q. Dai, J.-H. Cheng, D.-W. Sun, and X.-A. Zeng, “Advances in feature selection methods for hyperspectral image processing in food industry applications: A review,” Critical reviews in food science and nutrition, vol. 55, 04 2014.
E. Burke and G. Kendall, Search Methodologies—Introductory Tutorials in Optimization and Decision Support Techniques, pp. 97–125. Springer US, 01 2005.
J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, pp. 1942–1948 vol.4, 1995.
S. Sharma, K. M. Buddhiraju, and G. K. Dashondhi, “Hyperspectral image classification using ant colony optimization algorithm based on joint spectral-spatial parameters,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 3210–3213, 2017.
B. Guo, S. Gunn, R. Damper, and J. Nelson, “Band selection for hyperspectral image classification using mutual information,” Geoscience and Remote Sensing Letters, IEEE, vol. 3, pp. 522 – 526, 11 2006.
H. Gao, C. Li, H. Zhou, J. Hong, and L. Chen, “Band selection method of hyperspectral image for classification based on particle swarm optimization,” Journal of Computational and Theoretical Nanoscience, vol. 13, pp. 8823–8828, 11 2016.
X. Ren, H. Guo, S. Li, S. Wang, and J. Li, “A novel image classification method with cnn-xgboost model,” in Digital Forensics and Watermarking (C. Kraetzer, Y.-Q. Shi, J. Dittmann, and H. J. Kim, eds.), (Cham), pp. 378–390, Springer International Publishing, 2017.
Z. Huiting, J. Yuan, and L. Chen, “Short-term load forecasting using emd-lstm neural networks with a xgboost algorithm for feature importance evaluation,” Energies, vol. 10, p. 1168, 08 2017.
T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” CoRR, vol. abs/1603.02754, 2016.
S. S. Dhaliwal, A. A. Nahid, and R. Abbas, “Effective intrusion detection system using xgboost,” Inf., vol. 9, p. 149, 2018.
G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson, “Linear versus nonlinear pca for the classification of hyperspectral data based on the extended morphological profiles,” IEEE Geoscience and Remote Sensing Letters, vol. 9, no. 3, pp. 447–451, 2012.
X. Kang, X. Xiang, S. Li, and J. A. Benediktsson, “Pca-based edgepreserving features for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 12, pp. 7140–7151, 2017.
M. Gra ̃na, M. Veganzons, and B. Ayerdi, “Hyperspectral remote sensing scenes.” https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral Remote Sensing Scenes. Accessed: May 8, 2023.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of MachineLearning Research, vol. 12, pp. 2825–2830, 2011.
XGBoost Developers, “XGBoost Documentation.” https://xgboost.readthedocs.io/en/stable/. Accessed: May 8, 2023.

License

This project is available under the MIT License

RealXun / Methodology_for_Hyperspectral_Band_Selection_