Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets

How To Use

Install docker https://docs.docker.com/engine/install/ (macos, Windows or Linux)
Clone the Repository :

git clone https://github.com/Youssef-Harby/split-rs-data.git
Go to project directory :

cd split-rs-data
Copy and paste your raster(.tif) and vector(.shp) files into a seperated folders :


./split-rs-data/DataSet/  # Dataset root directory
|--raster  # Original raster data
|  |--xxx1.tif (xx1.png)
|  |--...
|  └--...
|
|--vector # All shapefiles in the same place (.shx, .shp..etc)
|  |--xxx1.shp
|  |--xxx1.shx / .prj  / .cpg / .dbf ....
|  └--xxx2.shp

Build the docke image : docker compose up --build
go to http://127.0.0.1:8888/
you will find your token in the cli of the image.
Open Tutorial.ipynb to learn
Or define your vector and raster folders in multi_raster_vector.py file and run it in docker by open cli and type :

python multi_raster_vector.py

TODO

Creating Docker Image for development env.
Splitting raster data into equal pieces with rasterio (512×512) thanks to @geoyee.
Splitting raster data into equal pieces with GDAL , https://gdal.org/.
Rasterize shapefile to raster in the same satellite pixel size and projection.
Convert 24 or 16 bit raster to 8 bit.
Export as jpg (for raster) and png (for rasterized shapefile) with GDAL.
Validation of training and testing datasets for paddlepaddle.
GUI
QGIS Plugin ➡️ Deep Learning Datasets Maker

Code In Detail ⬇️

First - Prepareing Datasets

1.Convert Vector to Raster (Rasterize) with reference coordinate system from raster tiff

all these tools made for prepare data for paddlepaddlea.

from osgeo import gdal, ogr

fn_ras = Input raster data (GTiff)
fn_vec = input vector data (Shapefile)

fn_ras = 'DataSet/raster/01/01.tif'
fn_vec = 'DataSet/vector/01/01.shp'
output = 'DataSet/results/lab_all_values.tif'

import the GDAL driver "ESRI Shapefile" to open the shapefile

driver = ogr.GetDriverByName("ESRI Shapefile")

open raster and shapefile datasets with (shapefile , 1)

(shapefile , 1) read and write in the shapefile
(shapefile , 0) read onle the shapefile

ras_ds = gdal.Open(fn_ras)
vec_ds = driver.Open(fn_vec, 1)

Get the :

GetLayer (Only shapefiles have one lyrs other fomates maybe have multi-lyrs) #VECTOR
GetGeoTransform #FROM RASTER
GetProjection #FROM RASTER

lyr = vec_ds.GetLayer()
geot = ras_ds.GetGeoTransform()
proj = ras_ds.GetProjection() # Get the projection from original tiff (fn_ras)
geot

(342940.8074133941,
 0.18114600000000536,
 0.0,
 3325329.401211367,
 0.0,
 -0.1811459999999247)

Open the shapefile feature to edit in it

layerdefinition = lyr.GetLayerDefn()
feature = ogr.Feature(layerdefinition)

feature.GetFieldIndex make you to know the id of a specific field name you want to read/edit/delete

Also you can list all fields on the shapefile by :

schema = []
    for n in range(layerdefinition.GetFieldCount()):
        fdefn = layerdefinition.GetFieldDefn(n)
        schema.append(fdefn.name)

Then I will delete the field called "MLDS" has been assumed by me

yy = feature.GetFieldIndex("MLDS")
if yy < 0:
    print("MLDS field not found, we will create one for you and make all values to 1")
else:
    lyr.DeleteField(yy)

add new field to the shapefile with a default value "1" and don't forget to close feature after the edits

new_field = ogr.FieldDefn("MLDS", ogr.OFTInteger)
lyr.CreateField(new_field)
for feature in lyr:
        feature.SetField("MLDS", 1)
        lyr.SetFeature(feature)
        feature = None

Set the projection from original tiff (fn_ras) to the rasterized tiff

drv_tiff = gdal.GetDriverByName("GTiff")
chn_ras_ds = drv_tiff.Create(
        output, ras_ds.RasterXSize, ras_ds.RasterYSize, 1, gdal.GDT_Byte)
chn_ras_ds.SetGeoTransform(geot)
chn_ras_ds.SetProjection(proj)
chn_ras_ds.FlushCache()

gdal.RasterizeLayer(chn_ras_ds, [1], lyr, burn_values=[1], options=["ATTRIBUTE=MLDS"])
chn_ras_ds = None
vec_ds = None

DONE

Second - Splitting raster and rasterized files to small tiles 512×512 depends on your memory

ds = gdal.Open(fn_ras)
gt = ds.GetGeoTransform()

get coordinates of upper left corner

xmin = gt[0]
ymax = gt[3]
resx = gt[1]
res_y = gt[5]
resy = abs(res_y)

import math
import os.path as osp

the tile size i want (may be 256×256 for smaller memory size)

needed_out_x = 512
needed_out_y = 512

round up to the nearest int

xnotround = ds.RasterXSize / needed_out_x
xround = math.ceil(xnotround)
ynotround = ds.RasterYSize / needed_out_y
yround = math.ceil(ynotround)

print(xnotround)
print(xround)
print(ynotround)
print(yround)

9.30078125
10
5.689453125
6

pixel to meter - 512×10×0.18

pixtomX = needed_out_x * xround * resx
pixtomy = needed_out_y * yround * resy

print (pixtomX)
print (pixtomy)

927.4675200000274
556.4805119997686

size of a single tile

xsize = pixtomX / xround
ysize = pixtomy / yround

print (xsize)
print (ysize)

92.74675200000274
92.74675199996143

create lists of x and y coordinates

xsteps = [xmin + xsize * i for i in range(xround + 1)]
ysteps = [ymax - ysize * i for i in range(yround + 1)]
xsteps

[342940.8074133941,
 343033.5541653941,
 343126.3009173941,
 343219.0476693941,
 343311.7944213941,
 343404.54117339413,
 343497.28792539414,
 343590.03467739414,
 343682.78142939415,
 343775.5281813941,
 343868.2749333941]

set the output path

cdpath = "DataSet/image/"

loop over min and max x and y coordinates

for i in range(xround):
    for j in range(yround):
        xmin = xsteps[i]
        xmax = xsteps[i + 1]
        ymax = ysteps[j]
        ymin = ysteps[j + 1]

        # gdal translate to subset the input raster

        gdal.Translate(osp.join(cdpath,  \
                        (str("01") + "-" + str(j) + "-" + str(i) + "." + "jpg")), 
                ds, 
                projWin=(abs(xmin), abs(ymax), abs(xmax), abs(ymin)),
                xRes=resx, 
                yRes=-resy, 
                outputType=gdal.gdalconst.GDT_Byte, 
                format="JPEG")
ds = None

Third - Spilit Custom Dataset and Generate File List

For all data that is not divided into training set, validation set, and test set, PaddleSeg provides a script to generate segmented data and generate a file list.

Use scripts to randomly split the custom dataset proportionally and generate a file list

The data file structure is as follows:

./DataSet/  # Dataset root directory
|--image  # Original image catalog
|  |--xxx1.jpg (xx1.png)
|  |--...
|  └--...
|
|--label  # Annotated image catalog
|  |--xxx1.png
|  |--...
|  └--...

Among them, the corresponding file name can be defined according to needs.

The commands used are as follows, which supports enabling specific functions through different Flags.

python tools/split_dataset_list.py <dataset_root> <images_dir_name> <labels_dir_name> ${FLAGS}

Parameters:

dataset_root: Dataset root directory
images_dir_name: Original image catalog
labels_dir_name: Annotated image catalog

FLAGS:

FLAG	Meaning	Default	Parameter numbers
--split	Dataset segmentation ratio	0.7 0.3 0	3
--separator	File list separator	"	"
--format	Data format of pictures and label sets	"jpg" "png"	2
--label_class	Label category	'__background__' '__foreground__'	several
--postfix	Filter pictures and label sets according to whether the main file name (without extension) contains the specified suffix	"" ""（2 null characters）	2

After running, train.txt, val.txt, test.txt and labels.txt will be generated in the root directory of the dataset.

Note: Requirements for generating the file list: either the original image and the number of annotated images are the same, or there is only the original image without annotated images. If the dataset lacks annotated images, a file list without separators and annotated image paths will be generated.

Example

python tools/split_dataset_list.py <dataset_root> images annotations --split 0.6 0.2 0.2 --format jpg png

Dataset file organization

If you need to use a custom dataset for training, it is recommended to organize it into the following structure: custom_dataset | |--images | |--image1.jpg | |--image2.jpg | |--... | |--labels | |--label1.png | |--label2.png | |--... | |--train.txt | |--val.txt | |--test.txt

The contents of train.txt and val.txt are as follows:

image/image1.jpg label/label1.png
image/image2.jpg label/label2.png
...

Full Docs : https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.3/docs/data/custom/data_prepare.md

import sys
import subprocess

theproc = subprocess.Popen([
"python", 
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\split_dataset_list.py", #Split text py script
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet",  # Root DataSet ath
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\image",  #images path
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\label", 
# "--split", 
# "0.6",  # 60% training
# "0.2",  # 20% validating
# "0.2",  # 20% testing
"--format", 
"jpg", 
"png"])
theproc.communicate()

(None, None)

Youssef-Harby / split-rs-data