semagrow / semagrow

A SPARQL query federator of heterogeneous data sources

Home Page:https://semagrow.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SemaGrow only seems to use 4 out of 5 configured SPARQL endpoints

natancox opened this issue · comments

I have two simple queries and both only use 4 out of 5 endpoints. Not one (which I expected) or all 5 (explainable) but 4?

One ignores http://rdfstoreomv-on-1.vm.cumuli.be:3030/blazegraph/namespace/cbb/sparql endpoint and the other ignores http://rdfstoreomv-on-2.vm.cumuli.be:3030/rdfstoreomv/archive/query endpoint.

To give a bit more details. A simple query like

PREFIX qb: <http://purl.org/linked-data/cube#> 
SELECT * 
WHERE {
  ?x a qb:Observation.
}

renders this as execution plan.

Note: http://rdfstoreomv-on-2.vm.cumuli.be:3030/rdfstoreomv/archive/query is present!

Plan@local-semagrow[costs [20002.34168,0] 99 tuples]
   Slice ( limit=100 )
      Plan@local-semagrow[costs [20002.34168,0] 234168 tuples]
         Union
            Plan@local-semagrow[costs [15001.75626,0] 175626 tuples]
               Union
                  Plan@local-semagrow[costs [10001.17084,0] 117084 tuples]
                     Union
                        Plan@local-semagrow[costs [5000.58542,0] 58542 tuples]
                           SourceQuery (source = http://data.vlaanderen.be/sparql)
                              Plan@http://data.vlaanderen.be/sparql[costs [58542,0] 58542 tuples]
                                 StatementPattern
                                    Var (name=x)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_4cfead57_uri, value=http://purl.org/linked-data/cube#Observation, anonymous)
                        Plan@local-semagrow[costs [5000.58542,0] 58542 tuples]
                           SourceQuery (source = http://rdfstoreomv-on-2.vm.cumuli.be:3030/rdfstoreomv/archive/query)
                              Plan@http://rdfstoreomv-on-2.vm.cumuli.be:3030/rdfstoreomv/archive/query[costs [58542,0] 58542 tuples]
                                 StatementPattern
                                    Var (name=x)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_4cfead57_uri, value=http://purl.org/linked-data/cube#Observation, anonymous)
                  Plan@local-semagrow[costs [5000.58542,0] 58542 tuples]
                     SourceQuery (source = http://data.kbodata.be/sparql)
                        Plan@http://data.kbodata.be/sparql[costs [58542,0] 58542 tuples]
                           StatementPattern
                              Var (name=x)
                              Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                              Var (name=_const_4cfead57_uri, value=http://purl.org/linked-data/cube#Observation, anonymous)
            Plan@local-semagrow[costs [5000.58542,0] 58542 tuples]
               SourceQuery (source = http://id.fedstats.be/sparql)
                  Plan@http://id.fedstats.be/sparql[costs [58542,0] 58542 tuples]
                     StatementPattern
                        Var (name=x)
                        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                        Var (name=_const_4cfead57_uri, value=http://purl.org/linked-data/cube#Observation, anonymous)

which is strange because I have configure 5 endpoints.

If I do this query however

PREFIX milieu: <http://id.milieuinfo.be/def#>
 
SELECT * 
WHERE {
   ?x a milieu:Exploitant.
}

I get a different set of endpoints that are queried.

Note: http://rdfstoreomv-on-2.vm.cumuli.be:3030/rdfstoreomv/archive/query is NOT present!

Plan@local-semagrow[costs [20003.87500,0] 99 tuples]
   Slice ( limit=100 )
      Plan@local-semagrow[costs [20003.87500,0] 387500 tuples]
         Union
            Plan@local-semagrow[costs [15002.90625,0] 290625 tuples]
               Union
                  Plan@local-semagrow[costs [10001.93750,0] 193750 tuples]
                     Union
                        Plan@local-semagrow[costs [5000.96875,0] 96875 tuples]
                           SourceQuery (source = http://data.vlaanderen.be/sparql)
                              Plan@http://data.vlaanderen.be/sparql[costs [96875,0] 96875 tuples]
                                 StatementPattern
                                    Var (name=x)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_135bf350_uri, value=http://id.milieuinfo.be/def#Exploitant, anonymous)
                        Plan@local-semagrow[costs [5000.96875,0] 96875 tuples]
                           SourceQuery (source = http://rdfstoreomv-on-1.vm.cumuli.be:3030/blazegraph/namespace/cbb/sparql)
                              Plan@http://rdfstoreomv-on-1.vm.cumuli.be:3030/blazegraph/namespace/cbb/sparql[costs [96875,0] 96875 tuples]
                                 StatementPattern
                                    Var (name=x)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_135bf350_uri, value=http://id.milieuinfo.be/def#Exploitant, anonymous)
                  Plan@local-semagrow[costs [5000.96875,0] 96875 tuples]
                     SourceQuery (source = http://data.kbodata.be/sparql)
                        Plan@http://data.kbodata.be/sparql[costs [96875,0] 96875 tuples]
                           StatementPattern
                              Var (name=x)
                              Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                              Var (name=_const_135bf350_uri, value=http://id.milieuinfo.be/def#Exploitant, anonymous)
            Plan@local-semagrow[costs [5000.96875,0] 96875 tuples]
               SourceQuery (source = http://id.fedstats.be/sparql)
                  Plan@http://id.fedstats.be/sparql[costs [96875,0] 96875 tuples]
                     StatementPattern
                        Var (name=x)
                        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                        Var (name=_const_135bf350_uri, value=http://id.milieuinfo.be/def#Exploitant, anonymous)

Hi @natancox,
Semagrow performs an ASK query prior the query planning and prunes sources that does not seem to satisfy the query resulting, hopefully, to a more efficient plan. Can you check if this is the case by issuing in every configured SPARQL endpoint separately

PREFIX qb: <http://purl.org/linked-data/cube#> 
ASK { ?x a qb:Observation. }

and

PREFIX milieu: <http://id.milieuinfo.be/def#>
ASK { ?x a milieu:Exploitant. }

Thanks for the quick reply. Smart move of checking the endpoint individually. Three of the endpoints are public so it should be easy to check.

And I noticed 2 out of 3 seem to be offline. The other is probably always returning HTML. I will try to get them behaving nicely before I bother you again.

Hi @natancox,

I'll close the bug for now, but please feel free to get back to us if you still have problems.

s

Hello @stasinos, it seems I am still having an issue. I will create a separate bug-report for it. But for completeness I will list my investigations.

Some more checks I did. I tested all endpoints and added, as you suggested

?query=PREFIX%20qb%3A%20%3Chttp%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23%3E%20%0AASK%20%7B%20%3Fx%20a%20qb%3AObservation.%20%7D

to each of the endpoints.

1) http://rdfstoreomv-on-1.vm.cumuli.be:3030/blazegraph/namespace/cbb/sparql

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
	</head>
	<boolean>false</boolean>
</sparql>

And is indeed being ignored.

** 2) http://id.fedstats.be/sparql **

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>406 Not Acceptable</title>
</head><body>
<h1>406 Not Acceptable</h1>
<p>An appropriate representation of the requested resource sparql could not be found on this server.</p>
Available variant(s):
<ul>
<li><a href="sparql">sparql</a> , type text/html, charset UTF-8</li>
</ul>
</body></html>

Seems to be an old Virtuoso instance.

** 3) http://rdfstoreomv-on-1.vm.cumuli.be:3030/blazegraph/namespace/lne/sparql **

Seems to be ok.

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
	</head>
	<boolean>true</boolean>
</sparql>

** 4) http://data.kbodata.be/sparql **

404 Resource not found

So, again, not a bug.

** 5) http://data.vlaanderen.be/sparql **

Returns a full webpage and is by default not machine readable!