Make dataset, code and codelist URIs configurable

Question

Make dataset, code and codelist URIs configurable

Robsteranium opened this issue 6 years ago · comments

The codelist pipeline establishes a URI pattern for codes (and for the concept-scheme URI itself) based upon (slugising) the label assigned to the codelist. This convention suits us but is a little magic. We could instead have this passed to the pipeline as a parameter. We could make it optional, defaulting back to a slug derived from the label.

The cube pipeline should also allow for configuration of dataset-uris (this convention would then need to be applied to related resources such as the DSD and CompSpecs).

Robin Gower · Answer 1 · Tue Sep 04 2018 19:23:20 GMT+0800 (China Standard Time)

As per #71 users would like to be able to configure the whole of the code URI.

Robin Gower · Answer 2 · Tue Nov 13 2018 18:43:00 GMT+0800 (China Standard Time)

As per #81 I'm extending this to cover dataset-uris too.

Robin Gower · Answer 3 · Tue Dec 11 2018 17:35:04 GMT+0800 (China Standard Time)

For the codelist-pipeline this would mean a changing the signature of codelist-metadata - removing domain-def and codelist_slug leading to something like the following (for recreating the existing conventions):

csv_url "age-groups.csv"
codelist_name "Age Group Codelist"
codelist_uri_template "http://example.net/def/concept-scheme/age-groups"
code_uri_template "http://example.net/def/concept/age-groups/{notation}"
parent_uri_template "http://example.net/def/concept/age-groups/{parent_notation}"

Robin Gower · Answer 4 · Tue Dec 11 2018 19:10:44 GMT+0800 (China Standard Time)

For consistency we should also change the components pipeline. Instead of having components-metadata take a domain-def param, we would configure it as follows:

ontology-uri "http://example.net/ontology/components"
component-uri-template "http://example/def/{component_type_slug}/{notation}"

We could then require some further fields to be provided explicitly as columns in the source component-csv:

a "Range" column for class URIs for the rdfs:range property (or removing this entirely as we haven't yet wired this up with the codelist-pipeline)
a "Notation" column instead of having table2qb derive this by slugging the label

Robin Gower · Answer 5 · Tue Dec 11 2018 23:12:09 GMT+0800 (China Standard Time)

Removing conventions from the cube-pipeline would be trickier.

A component-specification-template would be simple (e.g set to "http://example.net/data/population/{component_slug}" by convention). As would dataset-uri "http://example.net/data/population" and `dataset-structure-definition-uri "http://example.net/data/population/structure".

Two other URIs are not as straightforward...

The code-used-uri requires two templates - one in the context of the pipeline that makes the skos:Collection themselves (e.g. "http://example.net/data/population/codes-used/{component_slug}") and another in the context of the pipeline that populates the codes from cube margins "http://example.net/data/population/codes-used/{_name}". These should resolve to the same URI, the template is different because in one case the inputs file is one row per component and in the other one row per observation so the variables needed to populate the template differ. This is particularly unsatisfactory because the need for two templates arises from an implementation detail that users won't know about and really shouldn't have to worry about. Indeed many users will not even know/ care about the http://publishmydata.com/def/qb/codesUsed property as this is an non-standard extension to the rdf-cube vocabulary created by Swirrl. We could remove the code-used pipeline from the table2qb library and extract it to e.g. a pmd-table2qb application.

The observation-uri is built dynamically by table2qb combining a prefix with names of each of the dimensions (separated with /). A configurable template would allow another column to serve as a primary key (although we'd need a way to specify in the column configuration that a column be suppressed/ ignored and not treated as a component). Where the user had no preference, this configuration would be long-winded and error prone. If we allow the dimensions parts to follow the existing convention we could configure the prefix only, requiring e.g. observation-uri-prefix "http://example.net/data/population/" instead of using base-uri "http://example.net"/ dataset-slug "population" parameters and the domain-data convention (i.e. the "{base-uri}/data/" template).

Moreover replacing conventions with configuration like this makes the pipeline's signature very verbose (potentially 11 arguments). Requiring that much configuration up-front will present an obstacle to new users.

If we relax the need to configure the patterns completely, we might be able just to require a single dataset-uri "http://example.net/data/population" parameter, to which suffixes for codes-used-lists, observations, component-specifications, and data-structure-definitions are appended (i.e. by hard-coded convention).

Robin Gower · Answer 6 · Mon May 04 2020 23:24:03 GMT+0800 (China Standard Time)

PR #103 covers most of the above.

The introduction of edn files for configuration obviates the need to massively increase the number of pipeline arguments.

At this point the following URIs are customisable:

codelists: codelist-uri, code-uri, and parent-uri
components: component-uri, ontology-uri and component-class-uri (which we're not really using properly yet, referred to as "Range" above)
cube: dataset-uri, dsd-uri, component-specification-uri, used-codes-codelist-uri and used-codes-code-uri

The obvious exception is observation-uri. This isn't as urgent a requirement. Indeed the only use case we have for this so far would be to allow observation URIs to be specified with hashes. We could either do this within table2qb (e.g. deriving a observation-guid field from the dimension-properties and dimension-values) or by allowing users to provide this in a non-component/ non-value column (with #124). As noted above a hand-customised observation-uri will be tricky to configure.

Robin Gower · Answer 7 · Tue May 05 2020 16:01:04 GMT+0800 (China Standard Time)

I've created a new issue for the observation-uri customisation (#125) so this issue may be closed when #103 is merged.