This repository contains the implementation of a U-Net model for semantic segmentation, specifically designed for detecting rocks in images. The project includes data loading, model training, evaluation, and inference steps.
Introduction Dataset Model Architecture Dependencies Training the Model Running Inference Results Usage References Introduction This project implements a U-Net model for the task of semantic segmentation. The primary goal is to detect rocks in images, classifying each pixel as either part of a rock or the background. The U-Net architecture is well-suited for this task due to its encoder-decoder structure, which captures both local and global features.
The dataset consists of images and their corresponding masks. Each mask is a binary image where the pixel value is 1 if the pixel belongs to a rock and 0 otherwise.
Images Directory: Contains the RGB images. Masks Directory: Contains the binary masks. The dataset is split into training and validation sets to evaluate the model's performance.
The U-Net model used in this project consists of an encoder-decoder structure:
- Encoder: Four convolutional blocks with max-pooling layers to reduce the spatial dimensions.
- Bottleneck: A convolutional block that connects the encoder and decoder.
- Decoder: Four up-convolutional blocks with skip connections from the corresponding encoder layers.
- Final Layer: A convolutional layer that outputs the final segmentation map
Ensure you have the following libraries installed:
(Check unet_env_HowToCreate for more details)
- torch
- torchvision
- Pillow
- numpy
- matplotlib
- tqdm
Layer Type | Details |
---|---|
Input Layer | RGB Image (3 channels) |
Encoder | |
Block 1 | Conv2D (64 filters, 3x3) |
ReLU | |
Conv2D (64 filters, 3x3) | |
ReLU | |
MaxPooling (2x2) | |
Block 2 | Conv2D (128 filters, 3x3) |
ReLU | |
Conv2D (128 filters, 3x3) | |
ReLU | |
MaxPooling (2x2) | |
Block 3 | Conv2D (256 filters, 3x3) |
ReLU | |
Conv2D (256 filters, 3x3) | |
ReLU | |
MaxPooling (2x2) | |
Block 4 | Conv2D (512 filters, 3x3) |
ReLU | |
Conv2D (512 filters, 3x3) | |
ReLU | |
MaxPooling (2x2) | |
Bottleneck | |
Conv2D (1024 filters, 3x3) | |
ReLU | |
Conv2D (1024 filters, 3x3) | |
ReLU | |
Decoder | |
UpConv 1 | ConvTranspose2D (512 filters, 2x2) |
Concatenate with Encoder Block 4 | |
Block 1 | Conv2D (512 filters, 3x3) |
ReLU | |
Conv2D (512 filters, 3x3) | |
ReLU | |
UpConv 2 | ConvTranspose2D (256 filters, 2x2) |
Concatenate with Encoder Block 3 | |
Block 2 | Conv2D (256 filters, 3x3) |
ReLU | |
Conv2D (256 filters, 3x3) | |
ReLU | |
UpConv 3 | ConvTranspose2D (128 filters, 2x2) |
Concatenate with Encoder Block 2 | |
Block 3 | Conv2D (128 filters, 3x3) |
ReLU | |
Conv2D (128 filters, 3x3) | |
ReLU | |
UpConv 4 | ConvTranspose2D (64 filters, 2x2) |
Concatenate with Encoder Block 1 | |
Block 4 | Conv2D (64 filters, 3x3) |
ReLU | |
Conv2D (64 filters, 3x3) | |
ReLU | |
Final Layer | Conv2D (num_classes, 1x1) |
This table provides a clear overview of the U-Net model's structure, layer by layer, and shows the relationships between the Encoder, Bottleneck, Decoder, and Final Layer.
The encoder is responsible for capturing the context in the input image through a series of convolutional and max-pooling layers. It progressively reduces the spatial dimensions while increasing the depth (number of channels) to learn hierarchical features.
-
Block 1:
- Conv2D (64 filters, 3x3): Applies 64 convolutional filters of size 3x3.
- ReLU: Activation function.
- Conv2D (64 filters, 3x3): Another convolutional layer with 64 filters of size 3x3.
- ReLU: Activation function.
- MaxPooling (2x2): Reduces the spatial dimensions by a factor of 2.
-
Block 2:
- Conv2D (128 filters, 3x3): Applies 128 convolutional filters of size 3x3.
- ReLU: Activation function.
- Conv2D (128 filters, 3x3): Another convolutional layer with 128 filters of size 3x3.
- ReLU: Activation function.
- MaxPooling (2x2): Reduces the spatial dimensions by a factor of 2.
-
Block 3:
- Conv2D (256 filters, 3x3): Applies 256 convolutional filters of size 3x3.
- ReLU: Activation function.
- Conv2D (256 filters, 3x3): Another convolutional layer with 256 filters of size 3x3.
- ReLU: Activation function.
- MaxPooling (2x2): Reduces the spatial dimensions by a factor of 2.
-
Block 4:
- Conv2D (512 filters, 3x3): Applies 512 convolutional filters of size 3x3.
- ReLU: Activation function.
- Conv2D (512 filters, 3x3): Another convolutional layer with 512 filters of size 3x3.
- ReLU: Activation function.
- MaxPooling (2x2): Reduces the spatial dimensions by a factor of 2.
The bottleneck serves as a bridge between the encoder and decoder, capturing the most compressed representation of the input.
- Conv2D (1024 filters, 3x3): Applies 1024 convolutional filters of size 3x3.
- ReLU: Activation function.
- Conv2D (1024 filters, 3x3): Another convolutional layer with 1024 filters of size 3x3.
- ReLU: Activation function.
The decoder reconstructs the spatial dimensions of the input image while using the hierarchical features captured by the encoder through skip connections.
-
UpConv 1:
-
ConvTranspose2D (512 filters, 2x2): Upsamples the feature map by a factor of 2.
-
Concatenate with Encoder Block 4 output: Combines the upsampled features with corresponding encoder features.
-
Conv2D (512 filters, 3x3): Applies 512 convolutional filters of size 3x3.
-
ReLU: Activation function.
-
Conv2D (512 filters, 3x3): Another convolutional layer with 512 filters of size 3x3.
-
ReLU: Activation function.
-
-
UpConv 2:
-
ConvTranspose2D (256 filters, 2x2): Upsamples the feature map by a factor of 2.
-
Concatenate with Encoder Block 3 output: Combines the upsampled features with corresponding encoder features.
-
Conv2D (256 filters, 3x3): Applies 256 convolutional filters of size 3x3.
-
ReLU: Activation function.
-
Conv2D (256 filters, 3x3): Another convolutional layer with 256 filters of size 3x3.
-
ReLU: Activation function.
-
-
UpConv 3:
-
ConvTranspose2D (128 filters, 2x2): Upsamples the feature map by a factor of 2.
-
Concatenate with Encoder Block 2 output: Combines the upsampled features with corresponding encoder features.
-
Conv2D (128 filters, 3x3): Applies 128 convolutional filters of size 3x3.
-
ReLU: Activation function.
-
Conv2D (128 filters, 3x3): Another convolutional layer with 128 filters of size 3x3.
-
ReLU: Activation function.
-
-
UpConv 4:
-
ConvTranspose2D (64 filters, 2x2): Upsamples the feature map by a factor of 2.
-
Concatenate with Encoder Block 1 output: Combines the upsampled features with corresponding encoder features.
-
Conv2D (64 filters, 3x3): Applies 64 convolutional filters of size 3x3.
-
ReLU: Activation function.
-
Conv2D (64 filters, 3x3): Another convolutional layer with 64 filters of size 3x3.
-
ReLU: Activation function.
-
- Conv2D (num_classes, 1x1): Produces the final output with the same spatial dimensions as the input image, but with a depth equal to the number of classes (for segmentation).