Problem with semagrow optimizer
nefelipk opened this issue · comments
Semagrow optimizer does not always create the optimal plan.
The cost in both the sub-plans is estimated to be the same, which is not.
Take, for example, the following query:
<http://deg.iit.demokritos.gr/invekos/resource/462503>
<http://www.opengis.net/ont/geosparql#hasGeometry> ?geom .
?geom <http://www.opengis.net/ont/geosparql#asWKT> ?wkt .
Sometimes it chooses to answer first to #hasGeometry
and then to #asWKT
, which is the actual optimal solution:
BindJoin
Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
SourceQuery (source = http://invekos-virtuoso.default.svc.cluster.local:8890/sparql)
Plan@http://invekos-virtuoso.default.svc.cluster.local:8890/sparql[costs [2005257,0] 2005257 tuples]
StatementPattern
Var (name=_const_6be958d2_uri, value=http://deg.iit.demokritos.gr/invekos/resource/462503, anonymous)
Var (name=_const_30c2a947_uri, value=http://www.opengis.net/ont/geosparql#hasGeometry, anonymous)
Var (name=i_geom)
Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
SourceQuery (source = http://ontop.default.svc.cluster.local:8080/sparql)
Plan@http://ontop.default.svc.cluster.local:8080/sparql[costs [2005257,0] 2005257 tuples]
StatementPattern
Var (name=i_geom)
Var (name=_const_7ca52e9_uri, value=http://www.opengis.net/ont/geosparql#asWKT, anonymous)
Var (name=i_wkt)
And other times vice versa:
BindJoin
Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
SourceQuery (source = http://ontop.default.svc.cluster.local:8080/sparql)
Plan@http://ontop.default.svc.cluster.local:8080/sparql[costs [2005257,0] 2005257 tuples]
StatementPattern
Var (name=i_geom)
Var (name=_const_7ca52e9_uri, value=http://www.opengis.net/ont/geosparql#asWKT, anonymous)
Var (name=i_wkt)
Plan@local-semagrow[costs [5020.05257,0] 2005257 tuples]
SourceQuery (source = http://invekos-virtuoso.default.svc.cluster.local:8890/sparql)
Plan@http://invekos-virtuoso.default.svc.cluster.local:8890/sparql[costs [2005257,0] 2005257 tuples]
StatementPattern
Var (name=_const_6be958d2_uri, value=http://deg.iit.demokritos.gr/invekos/resource/462503, anonymous)
Var (name=_const_30c2a947_uri, value=http://www.opengis.net/ont/geosparql#hasGeometry, anonymous)
Var (name=i_geom)
This behavior is due to the fact that both subqueries have exactly the same cost according to the cost estimator. As a result, the optimizer chooses randomly between the two plans. However, the first triple in your query seems to have a smaller result, so I guess that this non-expected behavior is more of a cost estimation issue.