atzori / lodcc

This repository contains the code to prepare and analyze LOD data sets for graph-based analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pipeline status

Install dependencies

$ cd lodcc/
$ pip install -r requirements.txt
$ sudo apt-get install dtrx raptor2-utils

Commands

Tests

  • Small dataset

python lodcc.py --parse-resource-urls --use-datasets museums-in-italy --log-level-debug

  • Larger dataset

python lodcc.py --parse-resource-urls --use-datasets pokepedia-fr --log-level-debug

  • Multithreaded processing

python lodcc.py --parse-resource-urls --use-datasets museums-in-italy pokepedia-fr --threads 2

File system

  • Determine file sizes of all dumps
$ find dumps/ -type f -exec ls -s --block-size=M {} \; > dumps-sizes.txt
$ cat dumps-sizes.txt | sed -e '/edgelist/! s/^.*$/###/' -e '/^###/D' | sort -h -r | less

dbpedia

  • Download all filenames of all datasets into a file dbpedia-link.txt
$ curl -L http://downloads.dbpedia.org/2016-10/core-i18n/en/ -o dbpedia-link.txt
  • Filter unnecessary datasets and prefix with url
$ cat dbpedia-link.txt | cut -d '"' -f2 | egrep -i "ttl" | egrep -i -v "wkd|sorted|nested" | sed 's#^\(.*\)#http://downloads.dbpedia.org/2016-10/core-i18n/en/\1#' | sed -n '2,60p' > dbpedia-links.txt

About

This repository contains the code to prepare and analyze LOD data sets for graph-based analysis


Languages

Language:Python 88.7%Language:Shell 11.3%