More liberal pattern when running `terracotta ingest`
tomalrussell opened this issue · comments
It would be convenient to allow some punctuation marks as well as alphanumeric characters in the regex pattern matching key values.
I have a raster pattern like:
{type}__rp_{rp}__rcp_{rcp}__epoch_{epoch}__gcm_{gcm}.tif
and files like:cyclone__rp_10__rcp_8.5__epoch_2050__gcm_CMCC-CM2-VHR4.tif
river__rp_2__rcp_8.5__epoch_2030__gcm_MIROC-ESM-CHEM.tif
The regex match is too strict to allow the .
or -
in values - could it be relaxed? I can patch the filenames as a workaround, but a quick edit to the key bit of the regex so it's just [^_]+
seems to work okay locally:
https://github.com/DHI-GRAS/terracotta/blob/b7c67c3c2736401295644c1e8882b3f0f013bb5c/terracotta/scripts/click_types.py#L74
Unfortunately this is an unsolvable problem. Raster patterns are a confusing mess, and I am hesitant to make it even more confusing by capturing more stuff. They are really meant only for the simplest of use cases.
Is there any particular reason why you don't want to use the Python API for ingestion like we recommend in the docs?
I could be nudged to support regex patterns as a power-user feature:
$ terracotta ingest --raster-regex "(?P<type>\w+)__rp_(?P<rp>\d+)__rcp_(?P<rcp>\d+\.\d+)__epoch_(?P<epoch>\d+)__gcm_(?P<gcm>[\w-]+)\.tif"
But I think you have to agree that the patterns are quite messy, so it might be easier to use the Python API :)
I can see the problem. I guess I asked because the ingest
subcommand almost does what I want and tweaking it seemed easier than learning how to use the Python API.
I can't see myself typing out that --raster-regex
example (correctly, first time!), and at that point I'd be writing some kind of script anyway.
I wonder if a simpler example script in the docs might help:
import os
from typing import Dict, List
import tqdm
import terracotta
# Define the location of the SQLite database
# (this will be created if it doesn't already exist)
DB_NAME = f"./terracotta.sqlite"
# Define the list of keys that will be used to identify datasets.
# (these need to match the key_values dicts defined in RASTER_FILES below)
KEYS = ["type", "rp", "rcp", "epoch", "gcm"]
# Define a list of raster files to import
# (this is a list of dictionaries, each with a file path and the values for
# each key - make sure the order matches the order of KEYS defined above)
#
# This part of the script could be replaced with something that makes sense for
# your data - it could use a glob expression to find all TIFFs and a regular
# expression pattern to extract the key values, or it could read from a CSV,
# or use some other reference or metadata generating process.
RASTER_FILES = [
{
"key_values": {
"type": "river",
"rp": 250,
"rcp": 4.5,
"epoch": 2030,
"gcm": "NorESM1-M",
},
"path": "./data/river__rp_250__rcp_4x5__epoch_2030__gcm_NorESM1-M.tif",
},
{
"key_values": {
"type": "river",
"rp": 500,
"rcp": 8.5,
"epoch": 2080,
"gcm": "NorESM1-M",
},
"path": "./data/river__rp_500__rcp_8x5__epoch_2080__gcm_NorESM1-M.tif",
},
]
def load(db_name: str, keys: List[str], raster_files: List[Dict]):
driver = terracotta.get_driver(db_name)
# create an empty database if it doesn't exist
if not os.path.isfile(db_name):
driver.create(keys)
# sanity check that the database has the same keys that we want to load
assert list(driver.key_names) == keys, (driver.key_names, keys)
progress_bar = tqdm.tqdm(raster_files)
for raster in progress_bar:
progress_bar.set_postfix(file=raster["path"])
with driver.connect():
driver.insert(raster["key_values"], raster["path"])
if __name__ == "__main__":
load(DB_NAME, KEYS, RASTER_FILES)
I can draft a PR with an attempt at adding to the docs if you like - otherwise do close this, some version of using the API is the way forward 😊
This looks awesome, thanks! I would gladly accept a PR on this.