semagrow / semagrow

A SPARQL query federator of heterogeneous data sources

Home Page:https://semagrow.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with semagrow optimizer

nefelipk opened this issue · comments

Semagrow optimizer does not always create the optimal plan.
The cost in both the sub-plans is estimated to be the same, which is not.
Take, for example, the following query:

<http://deg.iit.demokritos.gr/invekos/resource/462503>
<http://www.opengis.net/ont/geosparql#hasGeometry> ?geom .
?geom <http://www.opengis.net/ont/geosparql#asWKT> ?wkt .

Sometimes it chooses to answer first to #hasGeometry and then to #asWKT, which is the actual optimal solution:

BindJoin
      Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
         SourceQuery (source = http://invekos-virtuoso.default.svc.cluster.local:8890/sparql)
            Plan@http://invekos-virtuoso.default.svc.cluster.local:8890/sparql[costs [2005257,0] 2005257 tuples]
               StatementPattern
                  Var (name=_const_6be958d2_uri, value=http://deg.iit.demokritos.gr/invekos/resource/462503, anonymous)
                  Var (name=_const_30c2a947_uri, value=http://www.opengis.net/ont/geosparql#hasGeometry, anonymous)
                  Var (name=i_geom)
      Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
         SourceQuery (source = http://ontop.default.svc.cluster.local:8080/sparql)
            Plan@http://ontop.default.svc.cluster.local:8080/sparql[costs [2005257,0] 2005257 tuples]
               StatementPattern
                  Var (name=i_geom)
                  Var (name=_const_7ca52e9_uri, value=http://www.opengis.net/ont/geosparql#asWKT, anonymous)
                  Var (name=i_wkt)

And other times vice versa:

BindJoin
      Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
         SourceQuery (source = http://ontop.default.svc.cluster.local:8080/sparql)
            Plan@http://ontop.default.svc.cluster.local:8080/sparql[costs [2005257,0] 2005257 tuples]
               StatementPattern
                  Var (name=i_geom)
                  Var (name=_const_7ca52e9_uri, value=http://www.opengis.net/ont/geosparql#asWKT, anonymous)
                  Var (name=i_wkt)
      Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
         SourceQuery (source = http://invekos-virtuoso.default.svc.cluster.local:8890/sparql)
            Plan@http://invekos-virtuoso.default.svc.cluster.local:8890/sparql[costs [2005257,0] 2005257 tuples]
               StatementPattern
                  Var (name=_const_6be958d2_uri, value=http://deg.iit.demokritos.gr/invekos/resource/462503, anonymous)
                  Var (name=_const_30c2a947_uri, value=http://www.opengis.net/ont/geosparql#hasGeometry, anonymous)
                  Var (name=i_geom)

This behavior is due to the fact that both subqueries have exactly the same cost according to the cost estimator. As a result, the optimizer chooses randomly between the two plans. However, the first triple in your query seems to have a smaller result, so I guess that this non-expected behavior is more of a cost estimation issue.