LinkedDataFragments / Client.js

[DEPRECATED] A JavaScript client for Triple Pattern Fragments interfaces.

Home Page:http://linkeddatafragments.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UTF-8 is not supported?

migalkin opened this issue · comments

I have a Fedbench query CD4:

SELECT ?actor ?news WHERE {
  ?film purl:title 'Tarzan' .
  ?film linkedMDB:actor ?actor .
  ?actor owl:sameAs ?x.
  ?y owl:sameAs ?x .
  ?y nytimes:topicPage ?news }

which has been rewritten to execute the following triple pattern against LinkedMDB endpoint in LDF server:

SELECT ?actor ?x WHERE { ?actor <http://www.w3.org/2002/07/owl#sameAs> ?x} LIMIT 100000 OFFSET 0

The Client throws the error:

WARNING TriplePatternIterator Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
      events.js:160
     throw er; // Unhandled 'error' event
     ^

 Error: Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
     at N3Lexer._syntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:358:12)
     at reportSyntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:325:54)
     at N3Lexer._tokenizeToEnd (/ldf_rest/node_modules/n3/lib/N3Lexer.js:311:18)
    at TrigFragmentIterator._parseData (/ldf_rest/node_modules/n3/lib/N3Lexer.js:393:16)
    at TrigFragmentIterator.TurtleFragmentIterator._transform (/ldf_rest/node_modules/ldf-client/lib/triple-pattern-fragments/TurtleFragmentIterator.js:47:8)
     at Immediate.readAndTransform (/ldf_rest/node_modules/asynciterator/asynciterator.js:959:12)
     at runCallback (timers.js:643:20)
     at tryOnImmediate (timers.js:610:5)
     at processImmediate [as _immediateCallback] (timers.js:582:5)

Does it mean that LDF Client does not support UTF-8?

Hi @migalkin, no, it means that the dataset was wrongly encoded. Note that

<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>

is an invalid URI in Turtle syntax; it should be

<http://dbpedia.org/resource/Espen_Skj\u00C3\u00B8nberg>

My guess is that on the server side, you have used an HDT file to serve LinkedMDB? And that this HDT file was generated with rdf2hdt in.nt out.hdt rather than rdf2hdt -f turtle in.nt out.hdt? The -f turtle option is necessary, because the N-Triples parser is broken.

Thank you @RubenVerborgh
I used the -f turtle option and now the query works fine [and the size of the hdt file is 20 times less =) ]

Excellent 😄

=> Do double check whether all the triples you want are in there though (i.e., hdtInfo out.hdt should show the correct number of total triples). When SERD encounters an error, the conversion process stops (most of the time with an error, sometimes without unfortunately).

@RubenVerborgh actually you are right, the dump created with the broken NT parser created an HDT file with all the triples from the LinkedMDB dump, but
rdf2hdt -f turtle linkedmdb.nt linkedmdb.hdt
results only in 160142 triples.

So what I do:

rdf2hdt -f turtle linkedmdb-latest-dump.nt linkedmdb-latest-dump.hdt            
RDF format: turtle
invalid IRI character `?' (escape %8B7E)essed.: 0 % / 0 %                      
invalid IRI character `?'00 K triples processed.: 0 % / 0 %                      
invalid IRI character `@' (escape %8B7E)ed.: 0 % / 50 %                      
invalid IRI character `@'K triples processed.: 0 % / 50 %                      
HDT Successfully generated.                                           
Total processing time: Clock(1 sec 836 ms 366 us)  User(1 sec 762 ms 380 us)  System(71 ms 897 us)

Then running hdtInfo:

<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#triples> "160142" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#properties> "8" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctSubjects> "149209" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctObjects> "52182" 

The original linkedmdb dump has:

 wc -l linkedmdb-latest-dump.nt 
6148121 linkedmdb-latest-dump.nt

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

Yes, I just fixed that in rdfhdt/hdt-cpp@d3b02a9

The solution is to ensure that the input file is valid, by passing it through a tool such as SERD first.

@RubenVerborgh I used those regexps we found before to clean the entire LinkedMDB and retain all the triples, so that SERD and HDT parser never throw an error, so the parsing went fine.
However, when I attach a new hdt to the server I have an error during setting it up:
This software cannot open this version of HDT File
I used the new version of the HDT C++ library you updated today.
Server issue?

Not a server issue, but possibly an outdated HDT-Node version. Can you post your HDT file somewhere so I can check?

Never mind, I found a testcase myself. On it.

@migalkin I found the bug and proposed a fix: rdfhdt/hdt-cpp#43

Summary: you built your HDT file using the latest master, which writes an (in my opinion) incorrect version number into the HDT file. The stable branch does not have this problem.

@migalkin This bug is now fixed; the laster version of hdt-cpp now generates compatible HDT files again.

@RubenVerborgh great, thanks for the update