Collection properties that become relevant when datasets are merged

Question

Collection properties that become relevant when datasets are merged

m-mohr opened this issue 9 months ago · comments

If you merge multiple datasets, you may want to have properties for some properties that are defined in the collections. For example:

Collection ID #10 (core?)
License #2 (extension)
Provider (extension)
... (extension)

Andy Jenkinson · Answer 1 · Sat Mar 16 2024 20:08:04 GMT+0800 (China Standard Time)

I don't really follow what you mean? There are many many things that have to change when you merge datasets; that's essentially what GFID does, it takes data from many sources (some large datasets, but also individual crowdsourcing) and offers them back out again as datasets which have different facets depending on how you want to consume them (e.g. 'all current field boundaries' or 'all distinct boundaries through time' or 'each individual representation of any boundary' or 'all the current field boundaries in Brazil').

Almost all the metadata properties attached to a boundary become arrays when you deduplicate the boundary, and collection-level items like dates either change or lose meaning. Typically you can't just naively merge collections because you cannot guarantee uniqueness in the union set, which is why the merged collection is really a different collection from the original one.

Perhaps you can take a look at how we solve this in GFID to understand the data model relationships. This page has a high level overview of the data types:
https://developer.varda.ag/gfid-identifiers
And this also has a diagram showing the boundary/boundary reference relationship (you can ignore the permissioning stuff.
You can see in the API reference that we have several types of collection:

Boundary references: these are the individual objects according to the source, presented as one big collection. Each object may have come from a different source, and because of that there can be items that have the same geometries, but different metadata. Each object has a guaranteed-unique ID that we assign, and any other metadata the source provided (e.g. their own ID, which may or may not be unique, dates, crop type etc). Some properties are 'standardised' like the determination method.
Boundaries: these are spatially deduplicated across multiple sources, again presented as one big FeatureCollection. They have no properties except those mandatory for a geometry, because all other properties are opinions of the underlying 'source' (analogous to a dataset) and are therefore stored in a child object, the 'boundary reference'

There is also 'fields' but this is not actually a spatial collection (this is why the API is not just an OGC Features API, because it writeable and deals with concepts like relationships between boundaries and between fields & boundaries, temporal change etc)

Both collections can be arbitrarily filtered, for example "give me only the features in this bbox, from this source, of this determination method". So you can make any arbitrary subcollection you like and the URI of the request would technically be usable as an ID for that collection. All of those collections would be guaranteed to have unique IDs, but it doesn't make sense to embed references between them at the object level because there are an infinite number of them - a feature is part of many collections. When we make static exports available soon, there will be distinct collections that can be given a name to though, like I mentioned in the first paragraph, and this is where we'd put dataset level metadata if it had to be in STAC format. Otherwise, it'd only exist at the level of the FeatureCollection in the API.

Matthias Mohr · Answer 2 · Wed Mar 20 2024 00:12:57 GMT+0800 (China Standard Time)

@andyjenkinson To clarify what I meant with merging:

Let's say we have a dataset that follows fiboa from the US and one from Germany in GeoParquet. I want to have them in the same GeoParquet. In this case I may want to have some columns that identify identify details such as license that otherwise for Germany would the all the same values and as such are in the collection.

Let's say US has columns id, geometry, area and us_state and a collection with provider ABC and license CC-0
Let's say Germany has columns id, geometry, area, perimeter and inspire:id and a collection with provider XYZ and license Apache-2.0

The merged dataset could have the following columns: id, geometry, area, perimeter, us_state, inspire:id, and collection. But some may also want to have provider and license.
So in this case it would be great to have schema definitions for provider and license.
The issue that I was looking at is that in the STAC collection (ignoring #4 for this explanation) where this data currently comes from are defined in schemas, but the Provider Object might be too complex for our use case here.

So this issue is just about this and how we want to tackle this. I think you were thinking about something else, where you also need to merge geometries or other things. I was more on the simpler side of things where data actually has no overlap and it's just about the properties.... Makes sense?

Matthias Mohr · Answer 3 · Fri Apr 12 2024 22:13:26 GMT+0800 (China Standard Time)

Another issue that we need to discuss: What happens when files are merged that have different extensions implemented with required fields? The current "required" implementation assumes non-nullable fields, which in case of a merge fails.

Andy Jenkinson · Answer 4 · Fri Apr 12 2024 23:22:25 GMT+0800 (China Standard Time)

To be honest I'd been treating multiple collections as out of scope for now because it's much more complex than the examples considered to date. After all that's what our system actually does - we merge many separate collections (we call them "sources") and serve them out through the API in multiple ways:

As sets of "boundary references", each of which is a single geometry + metadata "opinion" about a boundary.
As a deduplicated and merged set l, each feature being a single normalised geometry containing multiple metadata objects (one per unique reference) for all of the stuff like determinationMethod, dates, IDs etc to live.

There are many issues with the current Fiboa data model for representing the latter "multiple collections in a single file" case, since the values of most of the fields can different between collections.

Not sure what this has to do with this topic though so probably best to talk it through separately

Matthias Mohr · Answer 5 · Tue Apr 23 2024 15:40:15 GMT+0800 (China Standard Time)

Indeed, moved the discussion about required (extension) fields to #26

Matthias Mohr · Answer 6 · Fri Aug 30 2024 22:27:09 GMT+0800 (China Standard Time)

Closing here, we won't distinguish between collection properties and other properties anymore.