[DataPrep] WorldFloods 1.1 and WorldFloods 2.0 Full Pipeline
jejjohnson opened this issue · comments
The full preprocessing pipeline for WorldFloods1.1 and WorldFloods 2.0:
- Query Copernicus EMS
- Generate Floodmaps
- Query GEE with FloodMaps
- Generate GT with floodmaps and S2 images.
Visual Pipeline
Current Contributors
Demo
All of the steps below are based on the demo notebook found here:
no .cog
file format considerations (that we know of...)...
Copernicus Query & Save (ingest.py
)
- Download the Zip Files from Copernicus EMS
- Unzip the files into appropriate file directory structure
Copernicus Post-Processing (hardutils.py
)
These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.
- Search through unzipped files to get shape files (3 shape files)
- Area of Interest
- Observed event
- Hydrography (river); sub categories l/a
- Build the Copernicus Meta with Filenames of the items
Build FloodMap (softutils.py
)
- query copernicus meta data to get the names and shape files
- open with geopandas
- collapse polygons with labels (e.g. flood, hydro..., )
- convert new shape file to
geojson
- store
geojson
toml4floods_data_lake_ETL
bucket - store the floodmap meta data that was queried...? @gonzmg88
Sentinel-2 (S2) (ingest.py
)
We use the geojson files to get a bounding box to query GEE for S2 images
- query database (e.g., data, event, alert) for stored
geojson
files. - using the polygons, query GEE platform
- download S2 tiles that intersect to a bucket
- save as
.cog
- Pipe to Viz Mart @Lkruitwagen
Note: ee_download.py
Build Ground Truth (softutils.py
)
Cloud or No Cloud
- Query S2 database for images in ROI
- Last band from S2 image
- save
tiff
files to bucket,data_lake_mlmart
Water or No Water
- Query S2 database for images in ROI
- Magic...... See:
create_gt.py
- save
tiff
files to bucket,data_lake_mlmart
Visualization Territory
- Query GEE given floodmaps
- Save them to
.cog
format,data_lake_vizmart
bucket
@Lkruitwagen Any opinion about any intermediate steps?
Divided up the notebook here. This goes over the pipeline but does so with everything saved locally and needs to be converted to save to the bucket instead. @jejjohnson @satyarth934
The export image from the GEE already saves the S2 image as a COG GeoTIFF:
ml4floods/src/data/ee_download.py
Line 232 in 17b4552
This part of the code has the "smart" handling of the cloud or no cloud:
ml4floods/src/data/create_gt.py
Line 173 in 17b4552
Actually I would save that tutorial as-is (i.e. saving locally for every step in the pipeline following the query to copernicus EMS like you already did @nadia-eecs ) for the demos that we show people. If you’re doing a notebook demo like that as an outside user, they probably won’t have save access to the buckets. So that’s a nice tutorial for showing people how the pipeline works.
For the scripts and generating all of the data, we can change it to saving to bucket. And then for MLOPs, we can show them how to access the already saved stuff in the bucket for every point in the pipeline, e.g. floodmaps, s2 images, and ground truth(s), as that’s the only parts they’ll probably care about.
I agree, I'd say notebooks 1-4 that @nadia-eecs made are nice for internal use (to figure out how the ingestion pipeline works) and the previous notebook is nice as a tutorial for external users. In that tutorial we can even change the export part to export the S2 image to the user's Google Drive.
However I'd say prioritize the ingestion pipeline and put in the backlog the tutorial notebook!