spaceml-org / ml4floods

An ecosystem of data, models and code pipelines to tackle flooding with ML

Home Page:https://spaceml-org.github.io/ml4floods/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[DataPrep] WorldFloods 1.1 and WorldFloods 2.0 Full Pipeline

jejjohnson opened this issue · comments

The full preprocessing pipeline for WorldFloods1.1 and WorldFloods 2.0:

  1. Query Copernicus EMS
  2. Generate Floodmaps
  3. Query GEE with FloodMaps
  4. Generate GT with floodmaps and S2 images.

Visual Pipeline

Group SpaceML - WorldFloods 1 1, 2 0, Pipeline


Current Contributors


Demo

All of the steps below are based on the demo notebook found here:

no .cog file format considerations (that we know of...)...


Copernicus Query & Save (ingest.py)

  • Download the Zip Files from Copernicus EMS
  • Unzip the files into appropriate file directory structure

Copernicus Post-Processing (hardutils.py)

These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.

  • Search through unzipped files to get shape files (3 shape files)
    • Area of Interest
    • Observed event
    • Hydrography (river); sub categories l/a
  • Build the Copernicus Meta with Filenames of the items

Build FloodMap (softutils.py)

  • query copernicus meta data to get the names and shape files
  • open with geopandas
  • collapse polygons with labels (e.g. flood, hydro..., )
  • convert new shape file to geojson
  • store geojson to ml4floods_data_lake_ETL bucket
  • store the floodmap meta data that was queried...? @gonzmg88

Sentinel-2 (S2) (ingest.py)

We use the geojson files to get a bounding box to query GEE for S2 images

  • query database (e.g., data, event, alert) for stored geojson files.
  • using the polygons, query GEE platform
  • download S2 tiles that intersect to a bucket
  • save as .cog

Note: ee_download.py

Build Ground Truth (softutils.py)

Cloud or No Cloud

  • Query S2 database for images in ROI
  • Last band from S2 image
  • save tiff files to bucket, data_lake_mlmart

Water or No Water

  • Query S2 database for images in ROI
  • Magic...... See: create_gt.py
  • save tiff files to bucket, data_lake_mlmart

Visualization Territory

  • Query GEE given floodmaps
  • Save them to .cog format, data_lake_vizmart bucket

@Lkruitwagen Any opinion about any intermediate steps?

Divided up the notebook here. This goes over the pipeline but does so with everything saved locally and needs to be converted to save to the bucket instead. @jejjohnson @satyarth934

The export image from the GEE already saves the S2 image as a COG GeoTIFF:

formatOptions={"cloudOptimized": True},

This part of the code has the "smart" handling of the cloud or no cloud:

if cloudprob_in_lastband:

Actually I would save that tutorial as-is (i.e. saving locally for every step in the pipeline following the query to copernicus EMS like you already did @nadia-eecs ) for the demos that we show people. If you’re doing a notebook demo like that as an outside user, they probably won’t have save access to the buckets. So that’s a nice tutorial for showing people how the pipeline works.

For the scripts and generating all of the data, we can change it to saving to bucket. And then for MLOPs, we can show them how to access the already saved stuff in the bucket for every point in the pipeline, e.g. floodmaps, s2 images, and ground truth(s), as that’s the only parts they’ll probably care about.

I agree, I'd say notebooks 1-4 that @nadia-eecs made are nice for internal use (to figure out how the ingestion pipeline works) and the previous notebook is nice as a tutorial for external users. In that tutorial we can even change the export part to export the S2 image to the user's Google Drive.

However I'd say prioritize the ingestion pipeline and put in the backlog the tutorial notebook!