cube-pipeline ignores + character when generating code list URIs
jennet opened this issue · comments
Not sure if the problem will be the value template tags or something else, but if the code list has a concept with URI e.g. http://example.org/def/concept/age/100+
and the input CSV has a value "100+" (which is correctly set up in the columns.csv to convert to http://example.org/def/concept/age/{age}
it generates http://example.org/def/concept/age/100
, losing the +
which then results in a broken link.
I think this is potentially related (though not the same as) your whitespace issues too:
+
is unfortunately another encoding for
(space) so it might be that it is either being stripped for this reason, or it is being url decoded into http://example.org/def/concept/age/100
and then trimmed somewhere else.
https://stackoverflow.com/questions/2678551/when-to-encode-space-to-plus-or-20
Not sure what to do about it; it will depend and likely be subtle. I think the best bet is to possibly change our slugify implementation to convert +
into something like plus
-- but that has problems too.
In csv2rdf.uri-template the following happens:
(expand-template (parse-template "http://example.org/def/concept/age/{age}") {:age "100+"}) ;; => #object[java.net.URI 0x54f684ed "http://example.org/def/concept/age/100%2B"]
@jennet ok as I think we suspected this appears to be caused by table2qb slugize.
I don't know exactly what code path is being used in your case, (as I don't have your pipeline/config etc available) but I suspect you're somehow calling slugize which I think will end up using this implementation:
Essentially +
's get replaced with -
s then any slugs ending in -
are stripped off. The regex responsible for this is the line: (-> string ,,, (clojure.string/replace #"[^\w/]" "-"))
where the ^\w
means any non-word character; which includes +
.
so:
(slugize "100+") ;;=> "100"
Strictly speaking I don't know that this is the cause of your problem; but if this were being used it would cause your issue; so I think it's highly likely this is the root cause.
Potential fixes:
- Do as Bill suggests and change the input to be
100 and over
which will slugize better. - Introduce a different sluggizer that can be configured to that column, that doesn't replace
+
with-
.
@RickMoynihan thanks for looking into this. I think option 1 is the safest for now until we have got a more thorough development roadmap in place
Agreed. 2 is probably pretty easy to add, but might as well work around with 1 at this stage.
I feel like we did this deliberately for sns-graft, but I forget why.
Given that a +
is a valid part of a URI path then we don't need to strip it out. Tbh, the slugize action is probably a bit aggressive in rejecting non-words anyway. Perhaps there's a version available somewhere written by someone who's been through all the RFCs!