The Bio-Virtuoso docker container package

The virtuoso-goloso container

Purpose

Docker containers for easy deployment of the Virtuoso database engine with preloaded multiple biodatabases expressed by RDF.

The virtuoso-goloso (gluttonous virtuoso) container runs a instance of Virtuoso. Sinatra receives Turtle, RDF/XML, or OWL files via the HTTP POST method and put them into Virtuoso speedy using the isql command.

Dataset feeding containers download data from sources, if necessary, convert them into RDF, and send them to virtuoso-gloso. You can combine multiple feeding containers.

Clone Bio-Virtuoso

To use sample shell scripts, run

$ git clone git://github.com/misshie/bio-virtuoso.git
$ cd bio-virtuoso

Start a docker container

The misshie/virtuoso-goloso container is stored in DockerHub at https://hub.docker.com/r/misshie/virtuoso-goloso/ .

A sample script to invoke virtuoso-goloso is the following. See also sudo ./start-virtuoso-goloso.sh or sudo ./start-virtuoso-goloso-largemem.sh.

#!/bin/bash
docker stop virtuoso-goloso
docker rm virtuoso-goloso
docker run \
    -i -t \
    -p 1111:1111 \
    -p 8890:8890 \
    -p 4567:4567 \
    --name virtuoso-goloso \
    -e MaxQueryExecutionTime="21600" \
    -e NumberOfBuffers="85000" \
    -e MaxDirtyBuffers="65000" \
    -e SQL_PREFETCH_ROWS="10000" \
    -e SQL_PREFETCH_BYTES="160000" \
    misshie/virtuoso-goloso

Virtuoso-goloso supports the following environmental viriables given with the '-e' option:

environment variable	default value	comment
MaxQueryCostEstimationTime	undefined
MaxQueryExecutionTime	21600	6hrs
NumberOfBuffers	85000	4000000 is good for 48Gb RAM machines
MaxDirtyBuffers	65000	3000000 is good for 48Gb RAM machines
SQL_PREFETCH_ROWS	10000
SQL_PREFETCH_BYTES	160000

Dataset-feeding docker containers

Build a container

You have to build dataset-feeding containers to ensure the dataset is up-to-date. This step does not download any datasets. Run sudo ./containers/<FEEDING_CONTAINER>/build.sh. The following is an example for building the bio-birtuoso-hpo container.

$ cd containers/bio-virtuoso-hpo
$ sudo docker build -t misshie/bio-virtuoso-hpo .

Run a dataset-feeding container

Run sudo ./containers/<FEEDING_CONTAINER>/feed.sh. To feed bigger datasets, larger free RAM, and larger setting of NumerOfBuffers and MaxDirtyBuffers may be required. Duration to download datasets and convert to RDF may vary. Downloading and The following is a commandline to run the bio-virtuoso-hpo dataset-feeding container:

$ sudo docker run -it --link virtuoso-goloso:virtuoso-goloso misshie/bio-virtuoso-hpo

These containers exits after uploading datasets to virtuoso-goloso. If you want to check downloaded dataset, try sudo ./feed.sh /bin/bash and see files under /opt/bio-virtuoso.

list of dataset feeding containers (misshie/bio-virtuoso-*)

container	graph URL	description
hpo	http://purl.obolibrary.org/obo/hp.owl	Human Phenotype Ontology (HPO)
hpo-annotation-monarch	http://data.monarchinitiative.org/ttl/hpoa.ttl	HPO annotations RDFied by Monarch Initiative
	http://data.monarchinitiative.org/ttl/hpoa_dataset.ttl	HPO annotations dataset description
omim-monarch	http://data.monarchinitiative.org/ttl/omim.ttl	OMIM data RDFied by Monarch Initiative
	http://data.monarchinitiative.org/ttl/omim_dataset.ttl	OMIM dataset description
orphanet-monarch	http://data.monarchinitiative.org/ttl/orphanet.ttl	Orphanet data RDFied by Monarch Initiative
	http://data.monarchinitiative.org/ttl/orphanet_dataset.ttl	Orphanet dataset description
hgnc-monarch	http://data.monarchinitiative.org/ttl/hgnc.ttl	Human Genome Nomenclature Comittee (HGNC) data RDFied by Monarch Initiative
	http://data.monarchinitiative.org/ttl/hgnc_dataset.ttl	HGNC dataset description
go	http://purl.obolibrary.org/obo/go.owl	Gene Ontology (GO)
omim-gendoo-ja	http://misshie.jp/rdf/omim2ja.ttl	Gendoo's ja_JP translation of OMIM entries. See also http://gendoo.dbcls.jp/ developped by Takeru Nakazato
mp-jax	http://purl.obolibrary.org/obo/mp.owl	Mammalian Phenotype ontology (MP) of Jax

list of dataset feeding contaners using manually downloaded files

These contaners are designed for non-redistributable or proprietary datasets. Edit feed.sh to indicate a directory containing downloaded files.

container	graph URL	description
omim-omimorg	http://misshie.jp/rdf/omim/mim2gene.ttl	see http://omim.org/downloads
	http://misshie.jp/rdf/omim/mimTitles.ttl
	http://misshie.jp/rdf/omim/genemap.ttl
	http://misshie.jp/rdf/omim/morbidmap.ttl
	http://misshie.jp/rdf/omim/genemap2.ttl

Access the SPARQL endpoint

You can access Virtuoso at http://localhost:8890/. The SPARQL endpoint is at http://localhost:8890/sparql. You may need to open port 8890 to allow accessing the SPARQL endpoint from the Internet. For instance, you have to run sudo ufw allow 8890/tcp on Ubuntu 14.04 LTS.

Simple SPARQL sample

Show graphs fed by dataset-feeding containers

SELECT DISTINCT ?g WHERE {GRAPH ?g {?s ?p ?o}}

Accessing the SPARQL endpoint from the command-line

#!/bin/bash
url="http://localhost:8890/sparql"
format="text/tab-separated-values"
#format="text/turtle"

query=`cat <<EOF
SELECT DISTINCT ?property
FROM <http://purl.obolibrary.org/obo/hp.owl> 
WHERE { ?s ?property ?o . }
LIMIT 20
EOF
`
eval curl --form "\"format="${format}"\"" --form "\"query="${query}"\"" ${url}

Inside data-feeding container

Dataset feeding containers use the following ways to feed RDF files to virtuoso-goloso

For RDF/XML files:

#!/bin/bash
url="http://localhost:4567/rdfxml"
file="rdfxml.rdf"
graph="http://misshie.jp/rdf/test-rdfxml"
curl \
     -X POST \
     -F graph=${graph} \
     -F file=@${file} \
     ${url}

For Turtle/N3 files:

#!/bin/bash
url="http://localhost:4567/turtle"
graph="http://misshie.jp/rdf/test-turtle"
file="turtle.ttl"
curl \
     -X POST \
     -F graph=${graph} \
     -F file=@${file} \
     ${url}

For N-Quad files:

#!/bin/bash
url="http://localhost:4567/n-quad"
file="N-Quad.nq"
curl \
     -X POST \
     -F file=@${file} \
     ${url}

License

hmishima at nagasaki-u.ac.jp, twitter:@mishima_eng (en_US), @mishimahryk (ja_JP)

License: The Mit license. See LICENSE.txt for further details.

misshie / bio-virtuoso