WDAqua / Ontario

Ontario: Ontology-based Architecture for Semantic Data Lakes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ontario: Ontology-based Architecture for Semantic Data Lakes

Ontario is a Semantic Data Lake capable of storing and querying heterogeneous data (e.g., csv, json, rdf) in its original format. Ontario uses the RDF molecules approach as a logical representation of the heterogeneous data. MULDER federated query engine leverages RDF molecules metadata to efficiently perform query decomposition, source selection, query planning, and query execution.

Setting up a single container Ontario

One can test Ontario using a self contained Ontario container for small data. Self-contained Ontario contains:

  • MongoDB 3.4
  • Spark 2.1.1
  • Ontario endpoint: http://youraddress:5001/sparql

To test on your local machine, do the following:

  1. Pull Ontario from docker hub
 docker pull kemele/ontario:0.1-spark-2.1.1-hadoop2.7-mongodb_3.4
  1. Run Ontario:

Use sample data (BSBM Person data): The image contains a sample data of person.csv in /datasets and person collection within bsbm100 dataset in mongodb. To run this:

 docker run -d --name ontario-demo -p 5001:5000 -p 27017:27017 kemele/ontario:0.1-spark-2.1.1-hadoop2.7-mongodb_3.4

To use your own data:

  • To add raw files, do either of the following:
    • use docker copy to put files:
        docker cp /path/to/yourfile.csv.json:/datasets
      
    • mount your data folder to /datasets as:
       -v /path/to/csv/json/filesfolder:/datasets
      
    • use mongoimport to load data to mongodb:
       docker exec -it ontario-demo mongoimport --type csv|json [--headerline] --db [yourdatabase] --collection [collectionname] --file [path-to-json-or-csv-file]
      
  • Create RDF molecule templates for your dataset. RDF molecule templates file contains the following elements:
    • rootType: RDF type (rdf:type) or arbitry name of a molecule
    • predicates: list of predicates with range (if available)
    • linkedTo: list of range values (if available in predicates element)
    • wrappers: list of wrapper that provide a certain set of predicates of this RDF molecule template Example: person-template.json
     {
     "rootType": "http://xmlns.com/foaf/0.1/Person",
     "linkedTo": [],
     "predicates": [ { "predicate": "http://xmlns.com/foaf/0.1/mbox_sh1sum", "range": [] },
                    { "predicate": "http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/country", "range": [] },
                    { "predicate": "http://purl.org/dc/elements/1.1/date", "range": [] },
                    { "predicate": "http://purl.org/dc/elements/1.1/publisher", "range": [] },
                    { "predicate": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "range": [] }
                  ],
     "wrappers": [
           {
            "url": "localhost:27017",
            "urlparam": "",
            "wrapperType": "MongoDB",
            "predicates": [ "http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/country" ]
           },
           {
            "url": "local[*]",
            "urlparam": "",
            "wrapperType": "SPARKCSV",
            "predicates": [
                 "http://xmlns.com/foaf/0.1/mbox_sh1sum",
                 "http://purl.org/dc/elements/1.1/date",
                 "http://purl.org/dc/elements/1.1/publisher",
                 "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
             ]
           }
       ]
    }
  • Create RML mapping for csv, json, or mongodb collection. Example: sparkcsvmapping.ttl
    @prefix rr:   <http://www.w3.org/ns/r2rml#>.
    @prefix rml:  <http://semweb.mmlab.be/ns/rml#>.
    @prefix ql:   <http://semweb.mmlab.be/ns/ql#>.
    @prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    @prefix dc:   <http://purl.org/dc/elements/1.1/> .
    @prefix rev:  <http://purl.org/stuff/rev#> .
    @prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
    @prefix base: <http://eis.iai.uni-bonn.de/ontario/mapping#> .
    
    #PERSON mappings
    <#PersonMappings>
    rml:logicalSource [
      rml:source "file:///datasets/person.csv" ;
      rml:referenceFormulation ql:CSV
    ];
    rr:subjectMap [
      rr:template "{person}";
      rr:class foaf:Person
    ];
    
    rr:predicateObjectMap [
      rr:predicate dc:date;
      rr:objectMap [
        rml:reference "date";
        rr:datatype xsd:date
        ]
      ];
    
    rr:predicateObjectMap [
      rr:predicate foaf:mbox_sha1sum;
      rr:objectMap [
        rml:reference "mbox_sha1sum";
        rr:datatype xsd:string
      ]
    ];
    
    rr:predicateObjectMap [
      rr:predicate dc:publisher ;
      rr:objectMap [
        rml:reference "publisher";
        rr:datatype xsd:anyURI
      ]
    ];
    rr:predicateObjectMap [
          rr:predicate rdf:type ;
          rr:objectMap [
            rml:reference "type";
            rr:datatype xsd:anyURI
          ]
     ].  
    
  • Create configuration file: Configuration file points to templates and mappings. In addition, you can specify different parameters to spark context based on your system capacity. Example: config.json
{
"MoleculeTemplates": [
  {
    "type": "filepath",
    "path": "/ontario/templates/person-template.json"
  }
],
"WrappersConfig": {
  "MappingFolder": "/ontario/mappings",
  "MongoDB": {
    "type": "MongoDB",
    "url": "localhost:27017",
    "mappingfile": "mongodbmapping.ttl",
    "params": {
    }
  },
  "SPARKCSV": {
    "type": "SPARK",
    "url": "local[*]",
    "mappingfile": "sparkcsvmapping.ttl",
    "params": {
      "spark.driver.cores": "4",
      "spark.executor.cores": "4",
      "spark.cores.max": "4",
      "spark.default.parallelism": "4",
      "spark.executor.memory": "4g",
      "spark.driver.memory": "4g",
      "spark.driver.maxResultSize": "1g"
    }
  },
  "SPARKJSON": {
    "type": "SPARK",
    "url": "local[*]",
    "mappingfile": "sparkjsonmapping.ttl",
    "params": {
      "spark.driver.cores": "4",
      "spark.executor.cores": "4",
      "spark.cores.max": "4",
      "spark.default.parallelism": "4",
      "spark.executor.memory": "4g",
      "spark.driver.memory": "4g",
      "spark.driver.maxResultSize": "1g"
    }
  }
 }
}

Then, run the following with -v options pointing to the above files:

 docker run -d --name ontario-demo -v /path/to/csv/or/json/filesfolder:/datasets -v /path/to/config.json:/ontario/config/config.json -v /path/to/templatesfolder:/ontario/templates -v /path/to/mappingsfolder:/ontario/mappings  -p 5001:5000 -p 27017:27017 kemele/ontario:0.1-spark-2.1.1-hadoop2.7-mongodb_3.4

Check the status of the mongo and ontario services:

 docker logs -f ontario-demo
  1. Run queries
  • Use curl:
    curl -G --data-urlencode "query=select ?person where {?person a <http://xmlns.com/foaf/0.1/Person>}limit 10" http://0.0.0.0:5001/sparql
    
  • Use python code:
    import urllib
    import httplib
    
    query = """
          PREFIX foaf: <http://xmlns.com/foaf/0.1/>
              PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
              PREFIX dc: <http://purl.org/dc/elements/1.1/>
    
              SELECT DISTINCT ?person ?mbox ?country ?publisher
              where{
                  ?person a foaf:Person.
                  ?person dc:publisher ?publisher.
                  ?person bsbm:country ?country.
                  ?person foaf:mbox_sh1sum ?mbox
              } limit 10
         """
    params = urllib.urlencode({'query': prodq})
    headers = {"Accept": "*/*"}
    conn = httplib.HTTPConnection('0.0.0.0:5001')
    conn.request("GET", "/sparql" + "?" + params, None, headers)
    response = conn.getresponse()
    if response.status == httplib.OK:
          res = response.read()
          res = res.replace("false", "False")
          res = res.replace("true", "True")
          res = eval(res)
          print "results", res['result']
          print 'execTime', res['execTime']
          print 'totalRows', res['totalRows']
          print 'firstResult', res['firstResult']

Setting up Ontario cluser using docker-compose

(Coming soon ...)

About

Ontario: Ontology-based Architecture for Semantic Data Lakes