jacoscaz / quadstore

A LevelDB-backed graph database for JS runtimes (Node.js, Deno, browsers, ...) supporting SPARQL queries and the RDF/JS interface.

Home Page:https://github.com/jacoscaz/quadstore

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance issues with sparql queries

allforabit opened this issue · comments

Thanks for the great library. It's working very well for me for the most part. However I'm just running into issues with the speed of the sparql queries. I'm just surprised how quickly they slow down as the datastore gets bigger. I notice that the get methods don't seem to really slow down at all and stay within a few hundred milliseconds. I do realize that the sparql queries are doing a lot more and this is to be expected. However, to get a list of say the latest 50 items from a store with a few tens of thousands of triples is taking up to 30 seconds. Is this the expected amount of time or am I doing something incorrect? Are there strategies that could be suggested to get around this? Is it a matter of using the other methods (get, etc) to speed things up.

I've setup a test repo here: https://github.com/allforabit/quadstore-sandbox that generates 10000 realisticish entities resulting a few tens of thousands of triples and has a standard sparql query to pull in a primary entity and a linked item.

SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
        WHERE {
          ?id <ex://type> <ex://type/Item>;
              <ex://date-modified> ?dateModified;
              <ex://text> ?text;
              <ex://name> ?name.
          OPTIONAL {
            ?id <ex://author> ?author_id.
            ?author_id <ex://name> ?author_name.
          }
        }
        ORDER BY DESC(?dateModified)
        LIMIT 100
        OFFSET 0

(https://github.com/allforabit/quadstore-sandbox/blob/main/src/query-list.ts)

Any pointers or advice greatly appreciated!

Hello @allforabit !

I believe Comunica might be sorting by date in-memory, hence having to basically go through the entire dataset before being able to apply the OFFSET and LIMIT clauses.

This is likely due to the fact that Comunica, which is the framework upon which Quadstore's SPARQL engine is built, does not have a way to pass filtering expression down to quadstore in order to optimize operations that are otherwise required to be carried out in-memory.

However, @rubensworks (maintainer of Comunica) and I are actively working on this and, to this end, we're trying to standardize how filtering expressions may be passed down to supporting RDF/JS sources: https://github.com/rdfjs/query-spec .

I don't have a timeline for this, unfortunately, but we are actively working on it.

we're trying to standardize how filtering expressions may be passed down to supporting RDF/JS sources

This should indeed make this a lot faster.

However, 10K triples is not that much, so 30 seconds sounds like a lot, even without this optimization.
I do see an OPTIONAL in your query. Could you check how fast it is without this OPTIONAL clause?
If it now becomes a lot faster, we might experiencing the same problem as was reported in comunica/comunica#772.

Thanks both for the speedy replies! Yes @rubensworks I will try that. I should note that I'm generating 10k entities, each with 6 triples. So it's probably more like 60k triples and may explain the figures better.

When I change the query to:

        SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
        WHERE {
          ?id <ex://type> <ex://type/Item>;
              <ex://date-modified> ?dateModified;
              <ex://text> ?text;
              <ex://name> ?name.
          ?id <ex://author> ?author_id.
          ?author_id <ex://name> ?author_name.
        }
        ORDER BY DESC(?dateModified)
        LIMIT 100
        OFFSET 0

it actually slows down a bit.

Here's a breakdown of different queries and the time (in milliseconds) that they are taking:

Triple count: 608
╔═════════════════════════════════════════════════╤═════╗
║ With optional                                   │ 449 ║
╟─────────────────────────────────────────────────┼─────╢
║ Without optional                                │ 346 ║
╟─────────────────────────────────────────────────┼─────╢
║ Order By Date Timestamp                         │ 218 ║
╟─────────────────────────────────────────────────┼─────╢
║ Order by name                                   │ 212 ║
╟─────────────────────────────────────────────────┼─────╢
║ Simple (no linked author)                       │ 144 ║
╟─────────────────────────────────────────────────┼─────╢
║ Simple, ordered by timestamp (no linked author) │ 125 ║
╚═════════════════════════════════════════════════╧═════╝
Triple count: 6008
╔═════════════════════════════════════════════════╤══════╗
║ With optional                                   │ 2410 ║
╟─────────────────────────────────────────────────┼──────╢
║ Without optional                                │ 2371 ║
╟─────────────────────────────────────────────────┼──────╢
║ Order By Date Timestamp                         │ 2059 ║
╟─────────────────────────────────────────────────┼──────╢
║ Order by name                                   │ 2016 ║
╟─────────────────────────────────────────────────┼──────╢
║ Simple (no linked author)                       │ 983  ║
╟─────────────────────────────────────────────────┼──────╢
║ Simple, ordered by timestamp (no linked author) │ 1054 ║
╚═════════════════════════════════════════════════╧══════╝
Triple count: 59965
╔═════════════════════════════════════════════════╤═══════╗
║ With optional                                   │ 19868 ║
╟─────────────────────────────────────────────────┼───────╢
║ Without optional                                │ 23222 ║
╟─────────────────────────────────────────────────┼───────╢
║ Order By Date Timestamp                         │ 18635 ║
╟─────────────────────────────────────────────────┼───────╢
║ Order by name                                   │ 18319 ║
╟─────────────────────────────────────────────────┼───────╢
║ Simple (no linked author)                       │ 9370  ║
╟─────────────────────────────────────────────────┼───────╢
║ Simple, ordered by timestamp (no linked author) │ 9151  ║
╚═════════════════════════════════════════════════╧═══════╝

And the actual queries:

  # With Optional
        SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
        WHERE {
          ?id <ex://type> <ex://type/Item>;
              <ex://date-modified> ?dateModified;
              <ex://text> ?text;
              <ex://name> ?name.
          OPTIONAL {
            ?id <ex://author> ?author_id.
            ?author_id <ex://name> ?author_name.
          }
        }
        ORDER BY DESC(?dateModified)
        LIMIT 100
        OFFSET 0

  # Without optional
    SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
    WHERE {
      ?id <ex://type> <ex://type/Item>;
          <ex://date-modified> ?dateModified;
          <ex://text> ?text;
          <ex://name> ?name.
      ?id <ex://author> ?author_id.
      ?author_id <ex://name> ?author_name.
    }
    ORDER BY DESC(?dateModified)
    LIMIT 100
    OFFSET 0

 # No Author
      SELECT ?id ?text ?name
    WHERE {
      ?id <ex://type> <ex://type/Item>;
          <ex://date-modified> ?dateModified;
          <ex://text> ?text;
          <ex://name> ?name.
    }
    ORDER BY DESC(?dateModified)
    LIMIT 100
    OFFSET 0

# No author ordered by timestamp
    SELECT ?id ?text ?name
    WHERE {
      ?id <ex://type> <ex://type/Item>;
          <ex://date-modified-timestamp> ?dateModified;
          <ex://text> ?text;
          <ex://name> ?name.
    }
    ORDER BY DESC(?dateModified)
    LIMIT 100
    OFFSET 0

#  Order by date timestamp:
    SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
    WHERE {
      ?id <ex://type> <ex://type/Item>;
          <ex://date-modified-timestamp> ?dateModified;
          <ex://text> ?text;
          <ex://name> ?name.
      OPTIONAL {
        ?id <ex://author> ?author_id.
        ?author_id <ex://name> ?author_name.
      }
    }
    ORDER BY ?dateModified
    LIMIT 100
    OFFSET 0

 # Order By Name 
    SELECT ?id ?text ?dateModified ?name ?author_id ?author_name
    WHERE {
      ?id <ex://type> <ex://type/Item>;
          <ex://date-modified-timestamp> ?dateModified;
          <ex://text> ?text;
          <ex://name> ?name.
      OPTIONAL {
        ?id <ex://author> ?author_id.
        ?author_id <ex://name> ?author_name.
      }
    }
    ORDER BY ?name
    LIMIT 100
    OFFSET 0

I've updated the repo here with the new queries: https://github.com/allforabit/quadstore-sandbox

Just to clarify, @jacoscaz it doesn't matter that it's a date type? Once it's ordered and is using limit and offset it will have the same performance characteristics?

There was a search method on quadstore a while back. Do you think this would have the same issues? If that's not the case I might try to integrate it into my project as a temporary workaround.

@allforabit

I'm not sure I understand your first question but if you're asking whether using date literals may incur in performance penalties after we're done with with new RDF/JS spec... No, it should not. Quadstore indexes common numeric-ish literals (numbers, dates) using lexicographical representations that allow for range-based queries to be passed down to the persistence layer.

WRT to the search method, that comes from back when quadstore still offered a non-RDF API and a very basic internal query engine. We've moved away from both of those in favor of a smaller surface area and a greater level of integration with the RDF/JS community. Porting it to our current codebase would be a non-trivial effort, I think.

Those are interesting results! So it looks like the amount of triples really form the main bottleneck.
In this case, the filter/order pushdown seems like the only solution to properly optimize this.

@jacoscaz thanks! My question was more about the current version and if it made a difference what type it was sorted on and I think it's answered really in the performance test that it doesn't. That's great though that date will work similarly to other primitives once the new approach is added. I've been playing around with node quadstore for a while and really like the direction of aligning with rdf standards and sparql. It took me a while to get my head around it when there was an rdf version of the store and non rdf so really great work on that front :-)

@rubensworks yes it seems to grow more or less linearly with the amount of triples. I also noticed that it doesn't really matter what you set the limit to it will always result in the same speed.

If there's anything I can do to help get this feature landed please let me know. My background is as a web developer so I might struggle with some of the more lower level stuff but I can definitely help with testing and other tasks.

How feasible do you think it would be for me to take an initial stab at implementing the feature? I would probably need some guidance (the general approach, relevant files, etc) I know it may be a bit too involved, given that it spans the two projects.

@allforabit happy to hear you appreciate Quadstore's overall direction as a project. WRT to working on this, I should mention that I've never built something quite like Quadstore before and I am relatively new to RDF, too. I guess we're in the same boat!

That said, Comunica is a pretty complex project and I've spent some time doing much smaller tasks to get familiar with it, mostly related to bundle size and typings. I suggest doing the same. @rubensworks helped a lot, too, and documents a lot of things at https://comunica.dev .

As for quadstore, I've started experimenting with the new spec in https://github.com/beautifulinteractions/node-quadstore/tree/rdfjs/expression . Quadstore is a much simpler project than Comunica, it doesn't take long to get a feel of where things are but I'd be happy to take you through the code in a call if needed. I don't have a super-clear picture of "the" way to add support for the new RDF/JS spec, I'm learning as I go myself. Unfortunately, sometimes higher-priority issues come in (like #134 ) and they eat away at the allotted time for working on implementing the new spec.

@jacoscaz that's very impressive that quadstore is your first foray into rdf! I'm mostly just trying to wrap my head around how to use the different libs and getting an understanding of the underlying technologies. Thankfully I'm getting much more proficient with it all lately.

Yes that's the impression I get that comunica is a fairly big and complex project. That's good advice, I'll have a play around with doing some custom configurations and try to get a feel for how it all ties together. It sounds like it might be a little unrealistic to be able to work on this but would be very happy to help on the node quadstore end of things particularly for this feature.

A call would be great if it's not too much trouble. Let me know when it would suit for this, fairly flexible on my end :-)

@allforabit apologies for the delay, busy weeks. What about tomorrow morning, March 19th @ 11:00 AM UTC+1 ?

This sounds good, I'm UTC+0 so close to your timezone (Ireland). Let me know how to best get in contact with you.

@allforabit I'm closing this one to keep the conversation in #115 .