HAM10K

HAM10K Skin Cancer Image Classification

Literature Review

Dermatologist-level classification of skin cancer with deep neural networks
Poster - A Benchmark for Automatic Visual Classification of Clinical Skin Disease Images
Paper - A Benchmark for Automatic Visual Classification of Clinical Skin Disease Images

Objective

Top 3 accuracy exceeding 70% as a result returned less than 3 seconds

Challenge

Unbalanced dataset
Small data volume

Method

Pretrained model (transfer learning) + Retrined fine tuning on own dataset.
Data augmentation to reduce dataset imbalance and the small amount of data.

Model Selection

Originally, there are many choices, like yolo3, SSD, mask-rcnn, etc. But since this challenge is pure mutiple classification task, it is unnecessary to involve object detection and other technologies, because for example, the bounding box regression is taking significant computation time in calculating IOU, meanwhile, HAM10K dataset does not provide bounding box. So, due to literature review solutions and the data volume size, the classical networks like VGG19, ResNet50, etc are shortlisted as potential seletions.
Data -> Model complexity estimation candidate:
1. GoogleNet Inception v3 in literature review 1 with 129,450 clinical images(299 X 299) -> 10 times bigger than HAM10K dataset, so Inception v3 is the upper limit in model seletion.
2. VGG 16 is used in literature review 2 & 3. But based on the below comparison, it is very huge without obvious strength.
3. Taking MobileNet as the bottom limit in model seltion.
4. In order to double confirm the model choice, I may also use DenseNet121 or NASNetMobile for verification.

Nice to have features

t-SNE for visulization on last hidden layer feature map.
cross-validation.
Saliency maps.

Data Processing

Seven Classes
10,000 dermatoscopic images (600 X 450)

Dataset

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Experiment 1 and Result Analysis

Gist

My first experiment start with MobileNet, the code is in Experiment1_[tf_mobilenet_model].ipynb model.h5.

The basic ideas are as below:

Finding out the repeated images in given dataset, and carefully splitting the train and test dataset in order to make sure there is no duplicated images are taken into test set.
Using data augmentation to recude the impact of data imbalance.
Stratified split
Removing last 5 layers and adding one new dense layer for seven categories classification.
Freezing the all layers except the last 23 ones for retraining.
Callback involves checkpoint, reduce_lr, earlyStopping

Outcome

After 30 epoch, the training accuracy is around 90%, while the validation accuracy is around 0.74%. From the ploted graph, I noticed that the training loss is smoothly decrease, but there are many fluctuations in validation loss. and the same pattern happenes in accuracy. Possible reasons are inproperiate big learning rate and small batch size. But at this moment, I intend to focus on searching better models instead of hyperparameter tuning.

Experiment 2 and Result Analysis

Gist

My second experiment is also with MobileNet, the code is in Experiment2_[tf_mobilenet_model].ipynb tf_mobilenet_model.h5.. The motivation is to run this model again and observe performence change.

Outcome

After 43 epochs, the training accuracy is very decent which is around 88%, amd the validation one is around 0.73%. So it seems the first trail is relatively successful. But there are some issues as below:

Due to the respective train and validation accuracy, overfitting is still observed.
Since this model is run from checkpoint, very little improement is gained, which may be in plateau.
nv & bcc, nv & df and nv & mel, etc are relatively difficult to be differentiated.
Even with the boost of data augmentation, the minorities are still in poor performance, which may be explained by lacking of variety amd model memory.

Experiment 3 and Result Analysis

Based on the effort spent on experiment 1 and 2, in experimet 3, I intend to find some low cost solutions to agily try models at scale in terms of time.

So, I recall one of the project did before - tensorflow official example on image classifier retraining, with this scaffold, I can try each model from the tensorflow hub very easily.

Firstly, I tried mobilenet (full scale), the result is very close to experiment 1 and 2.

For double confirm, I tried mobilenet in 75% scale, the result decrease accrodingly as expected.

At this moment, I want to give Inception v3 a try.

Experiment 4 and Result Analysis

The outcome is not ideal, possible reason may be due to the augmentated dataset, the new added images for minorities impact the overall performence.

But the positive side is that with the help of bottleneck I can test models in a short time (4000 epochs).

Experiment 5 and Result Analysis

After a pause, I search for others solutions especially on the model selection part. Deep Learning Notes: Skin Cancer Classification using DenseNets and ResNets gives a good comparision, and the results of validation accuracy given on ResNet50 is similar to what I get in experiment 1 & 2 and the result in literature review 2 & 3, because the benchmark above shows ResNet and MobileNet acuracy is very close.

ANd the author points out the better result is gained in DenseNet which matches my initial analysis before, since the complexity of DenseNet is between InceptionV3 and mobileNet. But at this time, I need to switch to other important tasks on VLAD and FV.

Experiment 6 and Result Analysis

The same author in Experiment 5 shows the best result is achieved with the help of AutoML (95%). So I guess this is the upper limit, and I definitely will try it in the near future.

Improvement

Using GPU accerlarated library to reduce the time spent on data augmentation.
Enriching the dataset by using external data.
Assigning bigger weight in loss function for minority classes.

tees3r / HAM10K

HAM10K

Literature Review

Objective

Challenge

Method

Model Selection

Nice to have features

Data Processing

Dataset

Categories

Experiment 1 and Result Analysis

Gist

Outcome

Experiment 2 and Result Analysis

Gist

Outcome

Experiment 3 and Result Analysis

Experiment 4 and Result Analysis

Experiment 5 and Result Analysis

Experiment 6 and Result Analysis

Improvement

About

Languages