cube-pipeline ignores + character when generating code list URIs

Question

cube-pipeline ignores + character when generating code list URIs

jennet opened this issue 4 years ago · comments

Not sure if the problem will be the value template tags or something else, but if the code list has a concept with URI e.g. http://example.org/def/concept/age/100+ and the input CSV has a value "100+" (which is correctly set up in the columns.csv to convert to http://example.org/def/concept/age/{age} it generates http://example.org/def/concept/age/100, losing the + which then results in a broken link.

Rick Moynihan · Answer 1 · Tue Mar 17 2020 17:43:35 GMT+0800 (China Standard Time)

I think this is potentially related (though not the same as) your whitespace issues too:

#111
#113

+ is unfortunately another encoding for (space) so it might be that it is either being stripped for this reason, or it is being url decoded into http://example.org/def/concept/age/100 and then trimmed somewhere else.

https://stackoverflow.com/questions/2678551/when-to-encode-space-to-plus-or-20

Not sure what to do about it; it will depend and likely be subtle. I think the best bet is to possibly change our slugify implementation to convert + into something like plus -- but that has problems too.

Rick Moynihan · Answer 2 · Tue Mar 17 2020 18:35:08 GMT+0800 (China Standard Time)

In csv2rdf.uri-template the following happens:

(expand-template (parse-template "http://example.org/def/concept/age/{age}") {:age "100+"}) ;; => #object[java.net.URI 0x54f684ed "http://example.org/def/concept/age/100%2B"]

Rick Moynihan · Answer 3 · Tue Mar 17 2020 19:01:22 GMT+0800 (China Standard Time)

@jennet ok as I think we suspected this appears to be caused by table2qb slugize.

I don't know exactly what code path is being used in your case, (as I don't have your pipeline/config etc available) but I suspect you're somehow calling slugize which I think will end up using this implementation:

https://github.com/Swirrl/grafter-extra/blob/f4422fde5a8413be5313fb2bfdc3c9751a422be0/src/grafter/extra/cell/uri.clj#L5-L13

Essentially +'s get replaced with -s then any slugs ending in - are stripped off. The regex responsible for this is the line: (-> string ,,, (clojure.string/replace #"[^\w/]" "-")) where the ^\w means any non-word character; which includes +.

so:

(slugize "100+") ;;=> "100"

Strictly speaking I don't know that this is the cause of your problem; but if this were being used it would cause your issue; so I think it's highly likely this is the root cause.

Potential fixes:

Do as Bill suggests and change the input to be 100 and over which will slugize better.
Introduce a different sluggizer that can be configured to that column, that doesn't replace + with -.

Jen Williams · Answer 4 · Tue Mar 17 2020 19:05:15 GMT+0800 (China Standard Time)

@RickMoynihan thanks for looking into this. I think option 1 is the safest for now until we have got a more thorough development roadmap in place

Rick Moynihan · Answer 5 · Tue Mar 17 2020 19:06:50 GMT+0800 (China Standard Time)

Agreed. 2 is probably pretty easy to add, but might as well work around with 1 at this stage.

Robin Gower · Answer 6 · Tue Apr 21 2020 23:04:02 GMT+0800 (China Standard Time)

I feel like we did this deliberately for sns-graft, but I forget why.

Given that a + is a valid part of a URI path then we don't need to strip it out. Tbh, the slugize action is probably a bit aggressive in rejecting non-words anyway. Perhaps there's a version available somewhere written by someone who's been through all the RFCs!