LinkedDataFragments / HDT-Node

Native bindings for Node.js to access HDT compressed triple files.

Home Page:http://ruben.verborgh.org/blog/2014/09/30/bringing-fast-triples-to-nodejs-with-hdt/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generate HDT file

LaurensRietveld opened this issue · comments

What do you think about binding the HDT generation procedures of the C++ code as well? Part of the roadmap, or do you consider it out of the scope of this lib?

Since the procedure generally takes a very long time, it can just be invoked from the command line (with or without Node.js) using rdf2hdt. This is different for lookup procedures, where every millisecond matters.

So it's not on the roadmap at the moment, but I'm open for pull requests on this topic.

My C++ skills are lacking, but I can pull something up. Some things to discuss first:

  • Naming:

The 'fromFile' is a bit ambigue if we add rdf->hdt conversion. I suggest we keep the interface as it is for backwards compatibility, and add an rdf2hdt function to do the conversion

  • Functionality

I've already got an implementation working for file conversion. I propose to do something like this:

rdf2hdt(
    fromFile:string, 
    toFile:string, 
    config: {createIndex: boolean, baseUri: string, format:string}
    callback: (error:Error, doc:HDTDocument)=>void)`
)

The createIndex flag will do the same trick you mentioned here (rdfhdt/hdt-cpp#6).

Perhaps an optional callback there to give feedback on the HDT generation process. I'm not that familiar with the HDT progress estimation. Is it useful?
Additionally, I guess we should add a stream functionality as well. Not that HDT does actual streaming, but it would be easier to plugin in other server-side transformation pipelines. Something like rdf2hdtStream()?
The stream should have the regular events, plus a progress event

  • Error handling

Some errors are not thrown but printed. Right now I do the hacky way: storing the error stream in a buffer and checking for the Error string. Ideally, these should be properly thrown of course.
As far as the stdout goes, I simply silence these messages by redirecting this stream.
How did you approach this issue in the current implementation?

  • I would propose generateFile(source, destination, [options], [callback, [progressCallback]]).
    • We can rename hdt.fromFile into hdt.loadFile and release a new major version.
  • A separate progress callback would be useful (but can be added later on).
  • How would the stream function in this context? I don't see many use cases besides writing it to a file.
  • Regarding error handling, we should probably discuss with @bendiken what way we want https://github.com/rdfhdt/hdt-cppto evolve.

Agreed on naming convention. I'll see whether the progress callback is doable, but I guess it won't be trivial considering the changes that have to be made to the original lib.
And about the stream: you're right in that there won't be many other use cases next to writing it to a file. What I meant was that the HDT generation can be done at the tail of a stream sequence, e.g. after converting nquads to ntriples, or after a file upload stream.
Here, you'd like to simply plug the HDT generation at the end of the stream, instead of writing the preceding RDF to file first.

And yes, we should discuss error handling. The simplest approach would be to simply replace the cerr statements with throw statements, and add catches where needed. There are probably some exceptions that should not halt the HDT process (parsing errors?). Not sure how to deal with that scenario yet.

Got a first implementation ready: master...OpenTriply:master .
Not enough for a pull request yet, but feedback appreciated.
Also, some issues that need to be resolved:

  • Error handling: as discussed, errors are not thrown but printed. Requires modifications to https://github.com/rdfhdt/hdt-cpp . I'm up for forking it and adding throws where necessary (possibly including a 'verbose' option that falls back to the current behaviour). Won't be ideal to have this library based on a fork though. Ideas?
  • Dependencies: right now only the ntriples input works, as the other parsers aren't included. We could include raptor as dependency. I don't have a complete overview of all the dependencies of hdt-cpp, but I guess there must be others as well. Suggestions on which to include and how to deal with this?

Thanks for this, looks good!

Regarding dependencies, the proposal is to make SERD the default (rdfhdt/hdt-cpp#31), since the built-in parser has several issues. So I would directly use that one.

How do we proceed time-wise? Do we wait for hdt-cpp to proceed, or do we continue already?

@LaurensRietveld, see rdfhdt/hdt-cpp#18 for previous discussion on improving hdt-cpp error handling. If you wanted to take a first stab at this, just please make sure to throw/catch standard exceptions instead of the rather useless const char* values the library currently throws.

Ah. good to know about the expected changes to the HDT parser setup, good stuff. In that case I suggest we should wait for progress on the parsing front, and update this fork in a later stage. I'll probably use this fork myself already (after I fixed the error handling), and just use the (slightly buggy) built-in ntriple parser.

I'm up for replacing the error message with proper throw/catch exceptions. I'll fork the hdt-cpp and see what I can do.

I've now added the progress callback well, and updated the hdt-cpp dependency to point to the latest master branch that has improved error handling. There are two issues I can think of that we need to resolve:

  • After writing some tests, the current interface seems a bit cumbersome: generateFile(source, destination, [options], callback, [progressCallback]). Because callback is a required parameter, I propose to move that one a bit: generateFile(source, destination, callback, [options], [progressCallback]. Or alternatively, we could add the progressCallback to the options object (don't really care though)
  • We should probably wait for the serd parser to be integrated in hdt-cpp

What about a PR that we can merge with a feature branch for some more testing?

Thanks, looks good!

  • Note that what I suggested was generateFile(source, destination, [options], [callback, [progressCallback]]), i.e., the callback is optional, and a progress callback can only be attached if the callback argument is also provided. It is not customary in Node.js to have the options after the callback. Implementing the suggested approach will just require some logic like if (typeof options === 'function') progressCallback = callback, callback = options, options = {};.
  • I agree, PR with feature branch is a great idea.

great, PR is in

Thanks, following up in #8.