Yakonick / CFPQ_Data

Graphs and grammars for Context-Free Path Querying algorithms evaluation.

Home Page:https://jetbrains-research.github.io/CFPQ_Data/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CFPQ_Data

Graphs and grammars for experimental analysis of context-free path querying algorithms.

Prerequirements

  • GCC
  • Python 3

How to start

Just install requirements and run init.py:

pip3 install -r requirements.txt
python3 init.py

This script downloads data (real-world RDF files) and GTgraph, a suite of graph generators. After that, some synthetic graphs are generated with it.

In order to download/update one specific part of the dataset run:

python3 init.py --update [GroupName]

Options for [GroupName] are rdf, scalefree, full, worstcase, sparse, memoryaliases

Integration with graph DBs

We provide a set of scripts to simplify data loading into some popular graph databases.

RedisGraph

The dataset can be loaded to RedisGraph with tools/redis-rdf, for example:

cd ./tools/redis-rdf
python3 main.py --port [PORT] dir ../../data/Matrices/ScaleFree/
python3 main.py --port [PORT] file ../../data/Matrices/RDF/foaf.rdf foaf

Neo4j

To load RDF data into Neo4j database one can use Neosemantix plugin for Neo4j.

Dataset

Set contains both real-world data and synthetic graphs for several specific cases. All graphs are represented in RDF format to unify loading process.

Query format

Queries are represented as context-free grammars in the following format:

  • Line 0: a set of variable symbols delimited by spaces, the first one is the starting symbol

  • Line 1: a set of terminal symbols delimited by spaces

  • The rest of the lines are productions in the form:

    head -> body | body | ... | body

    where each body can contain basic regular expression, allowed operators:

    • The concatenation, the default operator, which can by represented either by a space or a dot (.)
    • The union, represented by |
    • The ? quantifier
    • The kleene star, represented by *

    Epsilon symbol should be represented by eps

Example query in such format:

S
a b
S -> a S b S
S -> eps

Grammar can be converted to CNF with tools/gramar_to_cnf, which uses pyformlang library to perform context-free grammar modifications.

Dataset structure

Graphs and grammars can be found in data — all graphs are divided into groups, which are placed in different directories. Each data/[GroupName]/Matrices contains graph descriptions and data/[GroupName]/Grammars — descriptions of queries.

RDF — fixed versions of real-world RDF files (links are provided for updating purposes only!):

Grammars list contains the following variants of same-generation query over different relation tipes.

MemoryAliases — real-world data for points-to analysis of C code.

Both grammars g1 and g2 specify the same language which is described in related papers. These two grammars were written in a different way in order to investigate dependencies on query specification format.

WorstCase — graphs with two cylces; the query Brackets is a grammar for the language of correct bracket sequences.

SparseGraph — graphs generated with GTgraph to emulate sparse data. The grammar provided is a variant of the same-generation query.

ScaleFree — graphs generated with GTgraph by using the Barab'asi-Albert model of scale-free networks. Use with grammar an_bm_cm_dn, which is a query for AnBmCmDn language.

FullGraph — cycle graphs, all edges are labeled with the same token. Use with A_star queries, which produce full graph on that dataset.

Reference values

Reference values for algorithms correctnes checking, which can be found in reference_values.csv, are in the following format: graph, grammar, control_sum, where control_sum is the number of paths generated by non-terminals of the grammars.

Example of the FullGraph dataset reference values in such format:

fullgraph_10,A_star0,"{'s': 100}"
fullgraph_100,A_star0,"{'s': 10000}"
...

Papers on CFPQ

List of CFPQ-related works. The list is not full, work in progress.

Graph databases

  • M. Yannakakis. "Graph-theoretic methods in database theory".
  • P. Sevon and L. Eronen. "Subgraph queries by context-free grammars".
  • J. Kuijpers, G. Fletcher, N. Yakovets, and T. Lindaaker. "An experimental study of context-free path query evaluation methods".
  • H. Miao and A. Deshpande. "Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization".
  • TBD

Static code analysis

  • T. Reps. "Program analysis via graph reachability".
  • X. Zheng and R. Rugina. "Demand-driven alias analysis for C".
  • TBD

Papers using this dataset

About

Graphs and grammars for Context-Free Path Querying algorithms evaluation.

https://jetbrains-research.github.io/CFPQ_Data/

License:Apache License 2.0


Languages

Language:Python 99.2%Language:Shell 0.8%