Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

--column-config is misleadingly named

RickMoynihan opened this issue · comments

--column-config is misleadingly named, as not everything in the file is a column; it also specifies measure's too (which occur in row data) for the measure type dimension.

I think we should consider deprecating this flag (still support it for legacy users) and either:

  1. Add a new flag named --component-config.
  2. Keep the name --column-config and require a new extra option --measures-config for a file containing the measures.

I think 1 is my preferred option, though I also wonder why measures aren't specified with components.

Which rows in the column-config file are not columns?

The name comes from the idea of a csvw:column, these are not necessarily components - e.g. the value column. The measures appear in rows because they are dimension-values for the measure-type dimension. They are also columns but we don't require this explicitly because we can infer it from measure-type, allowing users to specify a single value column. We could do away with the value column and instead have a column for each measure (similar to the plans for supporting multi-measure observations #23).

Introducing an unrelated --component-config option might become confusing for those using the components-pipeline.

Which rows in the column-config file are not columns?

It may be the example I've seen over someones shoulder (i.e. not in a repo yet) wasn't quite right; but e.g. here I see something similar:

Count is not a column in the input.csv, it's a dimension value for the Measure Type column; so it's treated differently to Gender even though it's just another Dimension.

The name comes from the idea of a csvw:column

Ahh ok, that makes sense. I assumed it was because because most of the rows in that file but not all map one-to-one with columns in the input, but the measures behave differently. It seems strange that the measureType dimension is special cased from other dimensions.

It's not a big deal because what we have works great, and I understand we inherit this complexity from the qb spec; but I think it is a point of confusion.

I could also be misunderstanding things here as my exposure to table2qb at the moment has just been via other people, rather than using it directly myself. So please take this issue with a big pinch of salt. It's mostly just to note a point of confusion, and raise the question of whether we can do anything to resolve this or make it clearer somehow. Will ponder some more... 🤔

The components.csv you linked is an input for the components-pipeline (which creates the qb:ComponentPropertys, in this case a :count a qb:MeasureProperty). This is different from the columns.csv input which configures the cube-pipeline (more on difference) such that it knows to interpret the value "Count" in the "Measure Type" column in the input.csv you linked.

This isn't particularly clear, especially since we've now moved columns.csv from resources to test/resources as it's now a run-time argument not init-time config. We ought to add it back into the examples directory (perhaps using that instead of test/resources in the tests). Indeed we might want to offer a default version for some sdmx components etc in resources.

The need to treat measures differently arises from them being (the only components that is) denormalised in the qb model. A more normalised representation might keep the measure type dimension property but have only a single/ generic measure property (e.g. sdmx:obsValue) instead of making the measure-property component vary by observation. This part of the spec is at least consistent with it's multi-measure approach and easier for single-measure cubes.

You're right, the documentation ought to be improved and examples updated to match the way the code works at the moment.

We've since taken some steps to clarify how table2qb inputs are declared.

First, there's some improved documentation which should help to make everything a bit clearer.

More substantively, we've added support for using measures in columns. This only works for for multi-measure observations at the moment. The specification in issue #101 would extend this to measures-dimension observations, allowing users to use measure components as columns instead of the magic Value column. As per that thread, this would remove a source of confusion at the expense of verbosity so we might continue to support both the clearer measures-columns declaration and the succincter values-column declaration going forward.

I'm therefore going to close this issue now. Thanks for the report. If it's still confusing, then perhaps we should add some more details to the docs or examples.