Timeout error

Question

Timeout error

ariutta opened this issue 10 years ago · comments

Hello,

Thanks for releasing this project. It seems really promising. I'm looking into using it to give biologists more options for querying the open-source data from our non-profit research group WikiPathways.org.

Right now, the software works great when I query a small subset of our data, but when I try querying a larger dataset, I get a timeout error:

"Error: Error: ETIMEDOUT\n    at Request.onResponse [as _callback] (/usr/local/share/npm/lib/node_modules/ldf-client/lib/HttpFetcher.js:77:32)\n    at self.callback (/usr/local/share/npm/lib/node_modules/ldf-client/node_modules/request/request.js:129:22)\n    at Request.EventEmitter.emit (events.js:95:17)\n    at null._onTimeout (/usr/local/share/npm/lib/node_modules/ldf-client/node_modules/request/request.js:591:12)\n    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)"

Since the examples demonstrate querying DBPedia, I know the software should be able handle my data, which is 24.3 MB in size. It's currently stored as JSON-LD in an online Mongo instance here. (Caution: 24.3 MB json file.)

I'm thinking the problem is either

using JSON-LD on Mongo instead of an actual triplestore, or
putting most of the data for each pathway into an array (e.g. the entities array here) is a bad data shape for efficient queries.

I can run this query when using our pre-production SPARQL endpoint as an datasource, so I'm assuming the main problem is that the software is only intended for small datasets when using JSON-LD as the datasource.

Should I be able to use 24MB of JSON-LD as a datasource, or is that outside the intended usage of the software?

Thanks.

Ruben Verborgh · Answer 1 · Tue May 06 2014 15:44:58 GMT+0800 (China Standard Time)

Hi @ariutta,

I'm very curious to hear about your project. Could you drop me a line?
Perhaps we can help you with the infrastructure or make your project a featured use case.

This problem seems to be a server issue (this issue tracker here is part of the client repository). 24MB is indeed large for a JSON-LD file; I can imagine it takes the server a lot of time to filter all of them. The drawback of JSON-LD is that there are no streaming parsers yet (they do exist for Turtle), so everything has to be loaded in memory before the first triples can be emitted.

So while the volume of triples is indeed not outside of the software's usage, it will not be efficient with JSON-LD. Indexed data structures are better at searching for specific triples, which is what basic LInked Data Fragments need. In that sense, JSON-LD is an example to get you up to speed quickly, but not intended for production use (we should document that).

In particular, we currently have some internal software (that will eventually be released as open source) to host datasets with high performance, based on HDT. We might be able to give you a preview of that.

Best,

Ruben

Anders Riutta · Answer 2 · Wed May 07 2014 02:36:29 GMT+0800 (China Standard Time)

Hi @RubenVerborgh, email sent. Thanks!

I'll move this ticket to the server repo.