Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Creating multiple URIs from multi-valued cells

Robsteranium opened this issue · comments

It would be useful to be able to say, for example, that a component is associated with multiple codelists.

:age a qb:DimensionProperty;
  qb:codelist :age-in-years, :age-in-bands .

To specify this it would be convenient to have a multi-valued cell in the codelists column of the components pipeline:

Label, Component Type, Codelist
Age, Dimension, http://example.com/age-in-years;http://example.com/age-in-bands

The w3c tabular data model allows multi-valued cell, with a separator column annotation (i.e. ";" in the above example).

This works fine for outputting e.g. multiple string values but not for URIs. In this case, we want to output a URI so we're using a valueUrl annotation of "{+codelist}". This causes csv2rdf to treat the cell's string value as a URI (without escaping it's content). The problems arises because the multiple values of the cell are passed to a single expansion of the URI template as (presumably) a list, rather than each being passed independently for expansion.

Thus, if we modify the schema to add a separator as follows:

      {
        "name": "codelist",
        "titles": "codelist",
        "propertyUrl": "qb:codeList",
        "separator": ";",
        "valueUrl": "{+codelist}"
      }

And pass the above example csv as input, we get a single URI with a comma in it:

:age a qb:DimensionProperty;
  qb:codelist <http://example.com/age-in-years,http://example.com/age-in-bands> .

As per the URI Templating RFC6570:

Multiple variables and list values have their values joined with "," if there is no predefined joining mechanism for the operator.

The tabular data model example 13 demonstrates why you might want this behaviour.

It's not immediately obvious to me how to resolve this. There may be some way we can specify that the cell values need to get their own URIs, either through the csvw metadata, or with the right URI template.

If it's not possible, we may need to change the implementation of csv2rdf (e.g. by introducing a new annotation that declares an alternate processing method) or perhaps create a new implementation of the template expander to support a different syntax.

As a work-around, an alternative would be to duplicate the row specifying the component, changing only the codelist, but this is of course undesirable:

Label, Component Type, Codelist
Age, Dimension, http://example.com/age-in-years
Age, Dimension, http://example.com/age-in-bands

Just checking I understand the issue...

So, broadly the issue is that we want a 1-many link between dimension and codeList. The w3c spec allows for csv2rdf to parse multiple values out of a cell; but it always assumes all those cells will be put into a single output value (URI in this case).

Is this right?

As a work-around, an alternative would be to duplicate the row specifying the component, changing only the codelist, but this is of course undesirable:

Can you explain why the work around is insufficient? It makes sense to me. Is it because we assume there's a single row per component at the minute? Is it just some theoretical purity we're breaking around tidyness (1 row per component) or is it fundamentally a problem?

Your understanding is correct @RickMoynihan, yes. At least that's what I've seen so far. There may be another way to configure it such that the multiple values lead to multiple object-URIs.

Indeed the work around would mean the input was no longer a tidy one component per row. Instead it would be one component-codelist per row. Practically this means that updates to rows would need to be coordinated. Typically what will happen is that someone will update e.g. the description field in one row only and we'll have a component with both the old and new description.

The CLARIAH Cow project demonstrates another way to resolve this: under their re-interpretation of the specification a datatype of xsd:anyURI is converted into a URI reference instead of a typed literal so you could use a schema like:

      {
        "name": "codelist",
        "titles": "codelist",
        "propertyUrl": "qb:codeList",
        "datatype": "xsd:anyURI",
        "separator": ";"
      }

I feel that their interpretation there is technically incorrect though as an xsd:anyURI isn't an RDF resource AFAICR