CarND3-P2-FCN-Semantic-Segmentation

Using Fully Convolutional Networks for semantic segmentation of driving scenes

Semantic Segmentation using Fully Convolutional Networks


In this project, we demonstrate one of the approaches to semantic scene understanding in the problem domain of self-driving car perception.

At the moment there are two state of the art approaches, both using convolutional neural networks. One is based on detecting bounding boxes around objects of interest (like YOLO and SSD) and classifying them. The other approach is semantic segmentation using Fully Convolutional Network (FCN) where each pixel of an image is classified into one of the relevant classes like 'driveable road space', 'other vehicles', 'pedestrians', 'buildings' etc.

Bounding box detector approach is faster, but does not give quite as valuable answers. For example: how would you draw a bounding box around driveable road space? The FCN approach is slower, but gives quite precise regions of interest in segmented image to be used directly in perception/planning pipelines of an autonomous vehicle.

In this work we implement FCN approach using paper by Shelhamer, Long and Darrell. Their code can be found here

Data Set: Cityscapes

Good labeled datasets are vital to success of supervised learning tasks. For the task at hand we chose Cityscapes dataset which provides detailed labeled examples of road scene images from 50 German cities, across all seasons, just daytime in moderate/good weather conditions. It has fine ground truth labels for 35 classes of various objects in the scene which are relevant for tasks of autonomous vehicle perception.

The data need to be downloaded. In this work we use gtFine_trainvaltest.zip (241MB) and leftImg8bit_trainvaltest.zip (11GB) files. The provided code was used to pre-process the data. In particular we have changed helpers/labels.py to use all labelled classes (cityscape_labels.py in this repo is same version of code.) And we have run preparation/createTrainIdLabelImgs.py to generate ground truth images for updated labels.

Our final cityscapes data tree looks like this:

├── README.md
├── cityscapesscripts
│   ├── __init__.py
│   ├── annotation
│   │   ├── cityscapesLabelTool.py
│   ├── evaluation
│   │   ├── __init__.py
│   ├── helpers
│   │   ├── __init__.py
│   │   └── labels.py
│   ├── preparation
│   │   ├── __init__.py
│   │   ├── createTrainIdLabelImgs.py
│   │   └── json2labelImg.py
│   └── viewer
│           └── zoom.png
├── data
│   ├── README
│   ├── gtFine
│   │   ├── test
│   │   │   ├── berlin
│   │   │   │   ├── berlin_000000_000019_gtFine_color.png
│   │   │   │   ├── berlin_000000_000019_gtFine_instanceIds.png
│   │   │   │   ├── berlin_000000_000019_gtFine_labelIds.png
│   │   │   │   ├── berlin_000000_000019_gtFine_labelTrainIds.png
│   │           ├── munster_000173_000019_gtFine_color.png
│   │           ├── munster_000173_000019_gtFine_instanceIds.png
│   │           ├── munster_000173_000019_gtFine_labelIds.png
│   │           └── munster_000173_000019_gtFine_polygons.json
│   ├── leftImg8bit
│   │   ├── test
│   │   │   ├── berlin
│   │   │   │   ├── berlin_000000_000019_leftImg8bit.png
│   │           └── munster_000173_000019_leftImg8bit.png
│   └── license.txt

We use 2975 labeled images for training:

$ find cityscapes/data/gtFine/train -type f -name '*gtFine*labelTrainIds.png'  | wc -l

Here is an example of original image and image with labels (as they use pixel intensities from 0 to 34 out of the range of 0..255 you really need to look hard to see the labels):

original image

labels image

It is assumed the cityscapes folder is next to this repo in local filesystem and it has similar structure as shown above.


We use Python 3, anadonda distribution

Provided requirements.txt lists the used packages.

The implementation is in pure tensorflow. We recommend building tensorflow from sources to fully utilise your hardware capabilities. In this work we used tensorflow 1.3

Implementation Notes

fcn8vgg16.py is the definition of network architecture (as per paper above). It is using VGG16 architecture for encoder part of the network. We use pre-trained VGG16 weights provided by Udacity for initialization before training. The download happens automatically first time you run training.

main.py is the driver script. It takes most of the inputs from command line arguments.

How to Run

If run without arguments main.py lists the possible options:

$ python main.py
usage: main.py [-h] [-g GPU] [-gm GPU_MEM] [-x {1,2}] [-ep EPOCHS]
               [-bs BATCH_SIZE] [-lr LEARNING_RATE] [-kp KEEP_PROB]
               [-rd RUNS_DIR] [-cd CKPT_DIR] [-sd SUMMARY_DIR] [-md MODEL_DIR]
               [-fd FROZEN_MODEL_DIR] [-od OPTIMISED_MODEL_DIR]
               [-ip IMAGES_PATHS] [-lp LABELS_PATHS] [-vi VIDEO_FILE_IN]
               [-vo VIDEO_FILE_OUT]

--gpu=1 enables use of GPU for training/inference (0 is for CPU-only run)

--xla=level enables use of XLA

--epochs=10 sets the number of training epochs

--batch_size=5 sets the training mini-batch size. For FCNs every pixel of the image is classified, so empirically batch size should be relatively small. We experimented with batch sizes between 5 and 10. Also be mindful of the size of the network -- you may need at least 8Gb+ of GPU memory to run the training.

--learning_rate=0.0001 sets the training learning rate

Provided scripts nn_xxxxx.sh demonstrate how to call main.py for all possible actions, which we detail below.


To train the network (includes download of pre-trained VGG16) run:

python main.py train --gpu=1 --xla=2 -ep=10 -bs=10 -lr=0.00001

Here is an example of its output:

2017-09-27 09:11:22.365928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-09-27 09:11:22.366200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 10.40GiB
2017-09-27 09:11:22.366213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-09-27 09:11:22.366217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2017-09-27 09:11:22.366223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-09-27 09:11:22.413727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-09-27 09:11:22.414399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from b'pretrained_vgg/vgg/variables/variables'
2017-09-27 09:11:29.055226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from ckpt/fcn8vgg16-26810
Train Epoch  1/10 (loss 0.195):   1%|          | 7/595 [00:13<18:20,  1.87s/batches]


INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: b'runs/20170927_091136/model/saved_model.pb'
TensorFlow Version: 1.3.0
Default GPU Device: /gpu:0
restored from checkpoint ckpt/fcn8vgg16-26810
continuing training after 26810 steps done previously
saving trained model to runs/20170927_091136/model

The checkpoints are saved to --ckpt_dir which defaults to ckpt

The summaries are saved to --summaries_dir which defaults to summaries You can see training visually by starting tensorboard

$ tensorboard --logdir summaries --host --port 8080

If you then open tensorboard address in your web browser you will see the graph visualisation and training statistics like the following:

training loss/IoU visualisation

Here we trained end-to-end (including the encoder/VGG16) for 90 epochs in 6 runs, first run was for 25 epochs with learning rate of 0.0001. Second run has same learning rate, which was too high as we see volatility in convergence. For remaining runs we used learning rate of 0.00001. The optimizer goal is cross-entropy loss.

We also measured mean Intersection over Union, IoU metric. On average, across all 35 classes, it is about 40%. But it is not weighted. Classes in Cityscapes dataset are not balanced. For example there are much less traffic signs pixels than that of road surface or sky. One way to improve accuracy (and training convergence) is to weigh both loss (we use standard mean cross entropy loss) and IoU with weights inversely proportional to how classes are represented. We achieved the mean loss or about 0.2 (for 35 classes). The convergence stops/becomes really slow after that.

Tensorboard also allows to see the input images alongside with visualised class predictions (here we rescale pixel intensities to 0..255 so it is easier to see what is going on):

training input/output visualisation

We can inspect the calculation graph in Tensorboard:

FCN calculation graph

Freezing Variables

We can use the trained network saved in runs/*/model or we can run a few optimisations for subsequent inference

First optimization we can do after training is freezing network weights by converting Variable nodes to constants

python main.py freeze --ckpt_dir=ckpt --frozen_model_dir=frozen_model

in our case we have 1861 ops in the input graph and 306 ops in the frozen graph. In total 38 variables are converted and all the nodes related to training are pruned. Saved network size falls from 568mb to 293mb.

Optimizing for Inference

We can further optimize the resulting graph using tensorflow tools. One such transformation is weights quantization We run this (and some other transformations) as follows:

python main.py optimise --frozen_model_dir=frozen_model --optimised_model_dir=optimised_model

This increases number of operations to 369 (to convert between quantised and normal quantities) but decreases network size to 73mb.

Inference on Images

We can run predictions on test Cityscapes images as follows:

python main.py predict --gpu=1 --xla=2 --model_dir=optimised_model

by default it runs on ../cityscapes/data/leftImg8bit/test/*/*.png and saves results in a new folder under runs

You would see at the end of the output:

Predicting on test images ../cityscapes/data/leftImg8bit/test/*/*.png to: runs/20170927_164210
Predicting (last tf call 53 ms, avg tf 51 ms, last img 145 ms, avg 144 ms): 100%|████████████████████| 1525/1525 [08:36<00:00,  2.97images/s]

The 'tf call/avg' is time, in milliseconds, to execute session.run to get the predictions. 'last img/avg' is time, in milliseconds, to superimpose the segmentation results over the original image. There is also an overhead to load and save images which is not measured.

Here is an example of input test image and resulting segmented output

original image

results of segmentation superimposed on original image

We see that it correctly identifies pedestrians, road, traffic lights, road signs, bicycles etc.

Inference on Video

Finally we test the results on video sequences. With 512x256 image size we can achieve 5 frames per second performance on GPU.

python main.py video --gpu=1 --xla=2 --model_dir=optimised_model --video_file_in=stuttgart02.mp4 --video_file_out=stuttgart02_segmented.mp4

will show output like this:

Running on video stuttgart02.mp4, output to: stuttgart02_segmented.mp4
[MoviePy] >>>> Building video stuttgart02_segmented.mp4
[MoviePy] Writing video stuttgart02_segmented.mp4
 65%|█████████████████████████████████████████████████████████████████▎                                   | 777/1201 [02:34<01:24,  5.04it/s

The result on video that is part of Cityscapes, i.e. network trained on same resolution pictures from similar environment. Works great!

Video result on Cityscapes video

And here is the result on a video from completely different environment. This is a highway driving scenario in California. It is also a full sized video with much larger resolution than the images we trained our network on. We see that it does not work as well as previous example. The processing time on this is 1.2 seconds/frame, i.e. much slower.

Video result on random video video

Ways to Improve

  • look at ways to avoid map_fn for image normalisation in tensorflow graph. it breaks optimised graph if we want to use remove_nodes(op=Identity, op=CheckNumerics) and quantize_nodes optimization
  • use weighted loss and IoU in inverse proportion to number of class examples


Using Fully Convolutional Networks for semantic segmentation of driving scenes


