Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make dataset, code and codelist URIs configurable

Robsteranium opened this issue · comments

The codelist pipeline establishes a URI pattern for codes (and for the concept-scheme URI itself) based upon (slugising) the label assigned to the codelist. This convention suits us but is a little magic. We could instead have this passed to the pipeline as a parameter. We could make it optional, defaulting back to a slug derived from the label.

The cube pipeline should also allow for configuration of dataset-uris (this convention would then need to be applied to related resources such as the DSD and CompSpecs).

As per #71 users would like to be able to configure the whole of the code URI.

As per #81 I'm extending this to cover dataset-uris too.

For the codelist-pipeline this would mean a changing the signature of codelist-metadata - removing domain-def and codelist_slug leading to something like the following (for recreating the existing conventions):

csv_url "age-groups.csv"
codelist_name "Age Group Codelist"
codelist_uri_template "http://example.net/def/concept-scheme/age-groups"
code_uri_template "http://example.net/def/concept/age-groups/{notation}"
parent_uri_template "http://example.net/def/concept/age-groups/{parent_notation}"

For consistency we should also change the components pipeline. Instead of having components-metadata take a domain-def param, we would configure it as follows:

ontology-uri "http://example.net/ontology/components"
component-uri-template "http://example/def/{component_type_slug}/{notation}"

We could then require some further fields to be provided explicitly as columns in the source component-csv:

  • a "Range" column for class URIs for the rdfs:range property (or removing this entirely as we haven't yet wired this up with the codelist-pipeline)
  • a "Notation" column instead of having table2qb derive this by slugging the label

Removing conventions from the cube-pipeline would be trickier.

A component-specification-template would be simple (e.g set to "http://example.net/data/population/{component_slug}" by convention). As would dataset-uri "http://example.net/data/population" and `dataset-structure-definition-uri "http://example.net/data/population/structure".

Two other URIs are not as straightforward...

The code-used-uri requires two templates - one in the context of the pipeline that makes the skos:Collection themselves (e.g. "http://example.net/data/population/codes-used/{component_slug}") and another in the context of the pipeline that populates the codes from cube margins "http://example.net/data/population/codes-used/{_name}". These should resolve to the same URI, the template is different because in one case the inputs file is one row per component and in the other one row per observation so the variables needed to populate the template differ. This is particularly unsatisfactory because the need for two templates arises from an implementation detail that users won't know about and really shouldn't have to worry about. Indeed many users will not even know/ care about the http://publishmydata.com/def/qb/codesUsed property as this is an non-standard extension to the rdf-cube vocabulary created by Swirrl. We could remove the code-used pipeline from the table2qb library and extract it to e.g. a pmd-table2qb application.

The observation-uri is built dynamically by table2qb combining a prefix with names of each of the dimensions (separated with /). A configurable template would allow another column to serve as a primary key (although we'd need a way to specify in the column configuration that a column be suppressed/ ignored and not treated as a component). Where the user had no preference, this configuration would be long-winded and error prone. If we allow the dimensions parts to follow the existing convention we could configure the prefix only, requiring e.g. observation-uri-prefix "http://example.net/data/population/" instead of using base-uri "http://example.net"/ dataset-slug "population" parameters and the domain-data convention (i.e. the "{base-uri}/data/" template).

Moreover replacing conventions with configuration like this makes the pipeline's signature very verbose (potentially 11 arguments). Requiring that much configuration up-front will present an obstacle to new users.

If we relax the need to configure the patterns completely, we might be able just to require a single dataset-uri "http://example.net/data/population" parameter, to which suffixes for codes-used-lists, observations, component-specifications, and data-structure-definitions are appended (i.e. by hard-coded convention).

PR #103 covers most of the above.

The introduction of edn files for configuration obviates the need to massively increase the number of pipeline arguments.

At this point the following URIs are customisable:

  • codelists: codelist-uri, code-uri, and parent-uri
  • components: component-uri, ontology-uri and component-class-uri (which we're not really using properly yet, referred to as "Range" above)
  • cube: dataset-uri, dsd-uri, component-specification-uri, used-codes-codelist-uri and used-codes-code-uri

The obvious exception is observation-uri. This isn't as urgent a requirement. Indeed the only use case we have for this so far would be to allow observation URIs to be specified with hashes. We could either do this within table2qb (e.g. deriving a observation-guid field from the dimension-properties and dimension-values) or by allowing users to provide this in a non-component/ non-value column (with #124). As noted above a hand-customised observation-uri will be tricky to configure.

I've created a new issue for the observation-uri customisation (#125) so this issue may be closed when #103 is merged.