Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding

keeganmcbride opened this issue · comments

Almost every data set used in the Estonian example includes the characters üõöä, but it seems that table2qb is not able to handle these characters. Is it possible to somehow support this? For some (many) just using a direct sub to aou would change the meaning of the word.

just a note that Keegan has sent me example files that illustrate the above problem - will attempt to diagnose

@BillSwirrl any updates on this?

Hi @keeganmcbride. I can't recreate the problem, sorry. If I use input csv files with UTF-8 encoding then the pipelines run fine and those characters are present and correct in the output (in both literals and URIs).

@BillSwirrl whats the encoding of the example files? Any luck in recreating this?

Hi @Robsteranium, an example CSV file content was, re encoding it should be standard UTF-8:

Day Notation Parent notation
Esmaspäev esmaspäev
Kolmapäev kolmapäev
Laupäev laupäev  
Neljapäev neljapäev  
Pühapäev pühapäev
Reede reede  
Teisipäev teisipäev  

Output from table2qb created day.ttl which looked like:

http://example.gr/def/concept/day/esmasp�ev skos:notation "esmasp�ev" .

_:row165 http://www.w3.org/ns/csvw#describes http://example.gr/def/concept/day/esmasp�ev , http://example.gr/def/concept/day/esmasp�ev , http://example.gr/def/concept/day/esmasp�ev , http://example.gr/def/concept/day/esmasp�ev .

http://example.gr/def/concept/day/esmasp�ev skos:topConceptOf http://example.gr/def/concept-scheme/day .

_:row165 http://www.w3.org/ns/csvw#describes http://example.gr/def/concept-scheme/day .

http://example.gr/def/concept-scheme/day skos:hasTopConcept http://example.gr/def/concept/day/esmasp�ev .

Thanks @keeganmcbride could you please upload the exact CSV file you used for the example in the previous comment? (Github generally doesn't seem to let you load CSV files as attachments so might have to zip it first) Thanks!

thanks!

ah, sorry @keeganmcbride looks like that zip file above doesn't include the Days concept scheme with the non-ascii characters. Do you have the file with the days in?

day (3).zip
sorry about that, should be here.

got it this time - thanks. We'll report back

Thanks. That file is in ISO-8859 format. Please save it in UTF-8 format and the pipeline will work fine. I've used:

iconv -f ISO8859-1 -t UTF-8 day.csv > day.utf8.csv
clojure -A:table2qb exec codelist-pipeline --codelist-csv day.utf8.csv --base-uri http://foobar.com --codelist-slug foo --codelist-name bar --output-file day.ttl

The result includes e.g. <http://foobar.comdef/concept/foo/esmaspäev> skos:notation "esmaspäev" . which looks correct to me.

Btw, I also note that the "columns.csv" file in the above "OGI Pilot Data.zip" has some problems. The URI templates must refer to the variables exactly as they're declared in the name column, and that this is case sensitive - e.g. the template should be something like http://example.gr/hello/def/concept/LINNAOSA3/{linnaosa} instead.

day-utf.csv.zip

Hi Keegan - I found it worked for me as well doing what Robin described. Here's a copy of the file after converting the character codes. That iconv command works on linux and mac - not sure what the equivalent might be on Windows, but there is probably a tool somewhere to do the same thing.

There are also options when saving an Excel file as CSV to specify UTF-8 which (I think) generally work. That might be worth a try as well.

Character encodings are frequently a bit of a nightmare in general

I guess this issue is now resolved so I'm going to mark it as closed.

If not that's not the case, please feel free to re-open it.