PCA & Imagecube before train test split cause data leak.

Question

PCA & Imagecube before train test split cause data leak.

YiwanChen opened this issue 5 years ago · comments

Hi, thank you for sharing the code. That helps alot.
I applied the code on my own hyperspectral image, the test result is very good with almost 100%. However, the model doesn't predict well on other similar images. One thing I can think of is because of PCA on the whole image, and different images will cause different PCA loadings. However, I tried train the model without PCA, i.e. put all spectral in, the test result is less than 40% accuarcy rate (haven't figured out why). Another thing that will compromise the model integrity is create image cube before train test split. That will mix training pixels and testing pixels all together. For any 25x25 image cube, it will definatly contain training and testing spectra.

Gopal Krishna · Answer 1 · Fri Dec 20 2019 17:14:16 GMT+0800 (China Standard Time)

Hey, thanks for the feedback and for reaching out.

Can you specify what kind of other images did you use?
Also, yes the authors are aware of the fact that the proposed model fails to converge without PCA and this is attributed to the fact that hyperspectral images tend to contain a lot of redundant data in the large numbers of spectral bands. Approaches other than PCA have been explored and demonstrated to work successfully in other works but in case of our proposed model which is fairly simple in terms of size and complexity, PCA is one the approaches that help the model deal with the aforementioned redundancy by transforming the hyperspectral volume and selecting the most informative top-k bands to be sent to the network. I hope this helps to clarify your query.

About the data leak, creating image cubes before or after train-test split wouldn't make a difference as the cubes will at the end be derived from the original image volume. The only way to alleviate this is to sample cubes from discrete pixels in the volume such that there is no data leak and there isn't a huge class imbalance. The authors followed the existing sampling methods that were being used in the previous works but you are absolutely right in pointing out that this sampling method is prone to data leak. Alternative sampling methods have been proposed recently for dealing with this issue.

Yiwei · Answer 2 · Sat Dec 21 2019 08:20:58 GMT+0800 (China Standard Time)

Thank you for your reply.
I use the model on hyperspectral images of grass, with the camera scanning different type of grass from about 1-meter distance.
Can you advise what is the approaches other than PCA that could apply on the hyperspectral image?
I will try to use for example half side of image for training and the other part for testing the image cube.
I like the concept of this model, and I will look into other models for comparisons.
Thanks.