danyaljj / SPARQL-Experiments

Here I collect some experiments and tests with DBPedia and its SPARQL interface.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SPARQL, made easy.

The following two points made me create this document:

  • RDFs are everywhere: a large resources like WikiData and DBPedia, based on refinement of Wikipedia, very useful for NLP research.
  • Documents for RDF query tools are mess: There are a lot of information about using them spread around the web, which sometimes are erronous. I wanted something simple and handy that I can easily refer to, whenever I need to use it.

For most this I am using the SPARQL query language.

Query endpoints

There are many online tools to run your queries:

So far YASGUI's been my favorite, generic SPARQL editor, which can be used to query from a desired endpoint.

Basic Notation of SPARQL

Prefixes

The prefixes help shorten queries. In other words, instead of using full URLs, we define prefixes for them to make the call shorter. All prefix URLs/URIs that do not contain hostname are prefixed with the hostname of the generating wiki.

Here is an exmple URI, if used directly in the script:

<http://this.is.a/full/URI/written#out>

Instead we defined the following prefix

PREFIX foo: <http://this.is.a/URI/prefix#>

and later in the code we do:

... foo:bar ... 

where bar is a concept/page/entity/etc defined on the target domain defined by foo.

Often Here are the list of prefixes for DBPedia. Also here is a similar list of WikiData. There is this website to look up important global prefix names.

Variables

Variables are indicated by a "?" or "$" prefix. For example:

?var1, ?anotherVar, ?and_one_more

Comments

You can add comments in your code, by using the # prefix:

# This is a comment, ye ye, yo yo, ye ye ... 

Literals

  • Plain literals: "a plain literal"
  • Plain literal with language tag: “bonjour”@fr
  • Typed literal: "13"^^xsd:integer
  • Some of these typed literals have shortcuts; here are some examples:
    • true is the same as “true”^^xsd:boolean
    • 3 is the same as “3”^^xsd:integer
    • 4.2 is the same as “4.2”^^xsd:decimal

Important note: SPARQL is case sensitive (because RDF is case sensitive). For example, DBpedia uses the convention that property names are start with a lower case letter (e.g. dbpedia-owl:country for "the country belonging to X is ...") and class names start with an upper case letter (e.g. dbpedia-owl:Country).

Matching patterns

These patterns are used to select sets of triples from the RDF database

  • Match an exact RDF triple: ex:myWidget ex:partNumber “XY24Z1” .
  • Match one variable: ?person foaf:name “Lee Feigenbaum” .
  • Match multiple variables: conf:SemTech2009 ?property ?value .

Highlevel view of the queries

(picture from here)

Use SELECT defines what you want and WHERE defines your conditions, restrictions, and filters. For example:

SELECT ?subject ?predicate ?object
WHERE {?subject ?predicate ?object} 
LIMIT 100

The SORT operator can be used to sort the results. The GROUP keyword can be used to group/cluster the results.

SELECT ?predicate (COUNT(*)AS ?frequency)
WHERE {?subject ?predicate ?obDEject}
GROUP BY ?predicate
ORDER BY DESC(?frequency)
LIMIT 10

Combining Results

  • Conjunction operator A . B: Join together the results of solving A and B by matching the values of any variables in common.
  • Left to join A OPTIONAL { B }: Join together the results of solving A and B by matching the values of any variables in common, if possible. Keep all solutions from A whether or not there’s a matching solution in B.
  • Disjunction { A } UNION { B }: Include both the results of solving A and the results of solving B.
  • Subtraction pattern A MINUS { B }: Solve A. Solve B. Include only those results from solving A that are not compatible with any of the results from B.

Examples:

Getting all the people's names

To get all the people with DBPedia:

select * { ?person a dbo:Person }
limit 100

(try here)

And getting people via WikiData:

SELECT ?person WHERE { ?person wdt:P31 wd:Q5 }
limit 100

(try here)

You may wonder how to combine these results into one single call, i.e. call both DBPedia and WikiData at the same time. This is often referrd to "federated querying". In order to do so, we have to use the SERVICE keyword to define two end-points:

PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX dbo: <http://dbpedia.org/ontology/> 

SELECT ?person WHERE { 
  SERVICE <http://dbpedia.org/sparql> {?person a dbo:Person }
  SERVICE <https://query.wikidata.org/sparql> { ?person wdt:P31 wd:Q5 }
} LIMIT 100 

(try here)

Getting all the Named-Entities that are "cats"

SELECT ?item ?itemLabel
WHERE
{
    ?item wdt:P31 wd:Q146 .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Knowing that "Saint Louis University" is a "University":

WikiData has an entry for "Saint Louis University" and an entry for "University". Given these enties (i.e. the WikiData ids), one can ask if one is instanceOf the other one. (See the side notes at the end of this document, on how to obtain the WikiData ids)

ASK {
    wd:Q734774 wdt:P31* wd:Q3918
}

(try here)

We can modify this to query everything that are instanceOf of "University" (i.e. list of universities).

SELECT ?thing 
WHERE {
    ?thing wdt:P31* wd:Q3918
}

(try here)

Or vice, get the super-types of "Saint Louis University":

SELECT ?thing 
WHERE {
    wd:Q734774 wdt:P31* ?thing
}

(try here)

which are "University", "Building", "private not-for-profit educational institution".

City names and the countries they are in

PREFIX dbo: <http://dbpedia.org/ontology/> 
SELECT DISTINCT ?city ?country 
WHERE { ?city rdf:type dbo:City ; 
              rdfs:label ?label ; 
              dbo:country ?country 
}

(try here)

Who is Harry Potter?

Let's say you want to describe Harry Potter.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?type ?superType WHERE 
{ 
  # give me ?type of the resource
  <http://dbpedia.org/resource/Harry_Potter_(character)> rdf:type ?type .

  # give me ?superTypes of ?type
  OPTIONAL {
   ?type rdfs:subClassOf ?superType .
  }
}

(try here)

which would yield results like "human", "person", "fictional character", etc.

Find graduates of Harward that are working at UIUC

Similar to the previous exmples, we find the ids for properties "employer" and "educated" and ids for entities "UIUC" and "Harvard", and use the conjunction operator ".":

SELECT ?person 

WHERE { 
	?person wdt:P69 wd:Q13371. 
  	?person wdt:P108 wd:Q457281
}

Now lets you want to get the labels for each of the triples:

SELECT ?person  ?personLabel

WHERE { 
	?person wdt:P69 wd:Q13371. 
    ?person wdt:P108 wd:Q457281  
    SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
	}
}

(try here)

Visualizing results on a map

Let's continue the example of Harvard graduates by extracting their birthplace and its coordinates. Next we can use the editoro to visualize the results of the coordinates on a Map.

SELECT ?person  ?personLabel ?birthPlaceLabel ?coordinates

WHERE { 
	?person wdt:P69 wd:Q13371. 
  	?person wdt:P19 ?birthPlace. 
    ?birthPlace wdt:P625 ?coordinates .
    SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
	}
}

(run here)

As you can see, most the Hardvard graduates are from east coast, USA. While west of China or central Africal almost have no representatives. Repeating the same thing for UIUC graduates would show that most UIUC gradutes are coming from MidWest, USA, and mostly from Chicago suburbs:

Birth place of African-American house representatives during history

SELECT ?personLabel ?coordinates

WHERE { 
    ?person wdt:P39 wd:Q13218630 . 
    ?person wdt:P172 wd:Q49085 . 
    ?person wdt:P19 ?birthPlace. 
    ?birthPlace wdt:P625 ?coordinates .
  	
    SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
    }
} 

(try here)

Visualizing timeline

The SPARQL editor of WikiData also has ability to visualize data as timeline. Here I am visualizing the US presidents according to their date of birth. (try here)

Distribution of members of house of representatives based on their ethnicity

SELECT ?ethLabel (COUNT(*) as  ?count)

WHERE { 
    ?person wdt:P39 wd:Q13218630 . 
    ?person wdt:P172 ?eth . 
  	
    SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
    }
} GROUP BY ?ethLabel

This is slightly misleading, since as we all know the majority is not African-Americans, but rather among the ones that have "ethnicity" label. In order to add an extra category for the ones that do no have an explicit ethnicity, we can use the OPTIONAL keyword to define it as optional.

SELECT ?ethLabel (COUNT(*) as  ?count)

WHERE { 
    ?person wdt:P39 wd:Q13218630 . 
    OPTIONAL { ?person wdt:P172 ?eth }. 
  	
    SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
    }
} GROUP BY ?ethLabel

which would result in 10806 representatives without ethnicity label.

SPARQL query as similarity/entailment measure

Finding distance between two nodes

Essentially finding the shortest common ancestor of A and B (idea from here)

DBPedia

SELECT ?a ?b ?super (?aLength + ?bLength as ?length)
{
  values (?a ?b) { (dbo:Person dbo:SportsTeam) }
  { 
    SELECT ?a ?super (COUNT(?mid) as ?aLength) { 
      ?a rdfs:subClassOf* ?mid .
      ?mid rdfs:subClassOf+ ?super .
    }
    GROUP BY ?a ?super
  }
  { 
    SELECT ?b ?super (COUNT(?mid) as ?bLength) { 
      ?b rdfs:subClassOf* ?mid .
      ?mid rdfs:subClassOf+ ?super .
    }
    GROUP BY ?b ?super
  }
}
ORDER BY ?length
LIMIT 1

(try here)

For WikiData, one can try RDF GAS API by blazegraph:

PREFIX gas: <http://www.bigdata.com/rdf/gas#>

SELECT ?super (?aLength + ?bLength as ?length) WHERE {
  SERVICE gas:service {
    gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" ;
                gas:in wd:Q5 ;
                gas:traversalDirection "Forward" ;
                gas:out ?super ;
                gas:out1 ?aLength ;
                gas:maxIterations 10 ;
                gas:linkType wdt:P279 .
  }
  SERVICE gas:service {
    gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" ;
                gas:in wd:Q349 ;
                gas:traversalDirection "Forward" ;
                gas:out ?super ;
                gas:out1 ?bLength ;
                gas:maxIterations 10 ;
                gas:linkType wdt:P279 .
  }  
} ORDER BY ?length
LIMIT 1

(try here)

(note: you can query this via json)

How to use the results in your code?

There a bunch of libraries that are intended for this; for example:

But my preferred way of using the result is using the POST/GET apis provided by many endpoints. For example, here is a GET api for Wikidata which provides json results:

  • WikiData: https://query.wikidata.org/sparql?format=json&query=PUT-YOUR-QUERY-HERE for example this.
  • DBPedia: https://dbpedia.org/sparql?format=json&default-graph-uri=PUT-YOUR-QUERY-HERE for example this.

Side notes

The university president, John Jenkins, described his hope that Notre Dame would become "one of the pre–eminent research institutions in the world" in his inaugural address.

If I use only "Notre Dame" it would give me the id to the disambiguation page, while using the right Wikipedia page "University_of_Notre_Dame" gives me the correct id.

Further reading

Typos, Comments, Suggestions?

Send a Pull-Request, or report in the issues! :)

About

Here I collect some experiments and tests with DBPedia and its SPARQL interface.