RapidsAtHKUST / SubgraphMatching

In-Memory Subgraph Matching: An In-depth Study by Dr. Shixuan Sun and Prof. Qiong Luo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SubgraphMatching

Introduction

We study the performance of eight representative in-memory subgraph matching (subgraph query) algorithms. Specifically, we put QuickSI, GraphQL, CFL, CECI, DP-iso, RI and VF2++ in a common framework to compare them on the following four aspects: (1) method of filtering candidate vertices in the data graph; (2) method of ordering query vertices; (3) method of enumer- ating partial results; and (4) other optimization techniques. Then, we compare the overall performance of these algo- rithms with Glasgow, an algorithm based on the constraint programming. Through experiments, we find that (1) the fil- tering method of GraphQL is competitive to that of the latest algorithms CFL, CECI and DP-iso in terms of pruning power; (2) the ordering methods in GraphQL and RI are usually the most effective; (3) the set intersection based local candidate computation in CECI and DP-iso performs the best in the enumeration; and (4) the failing sets pruning in DP-iso can significantly improve the performance when queries become large. Based on these new results, we recommend users to adopt specific techniques depending on the data graph den- sity and query size.

For the details, please refer to our SIGMOD'2020 paper "In-Memory Subgraph Matching: an In-depth Study" by Dr. Shixuan Sun and Prof. Qiong Luo. If you have any further questions, please feel free to contact us.

Please cite our paper, if you use our source code.

  • "Shixuan Sun and Qiong Luo. In-Memory Subgraph Matching: an In-depth Study. SIGMOD 2020."

Compile

Under the root directory of the project, execute the following commands to compile the source code.

mkdir build
cd build
cmake ..
make

Test

Execute the following commands to test the correctness of the binary file.

cd test
python test.py ../build/matching/SubgraphMatching.out

Execute

After compiling the source code, you can find the binary file 'SubgraphMatching.out' under the 'build/matching' directory. Execute the binary with the following command ./SubgraphMatching.out -d data_graphs -q query_graphs -filter method_of_filtering_candidate_vertices -order method_of_ordering_query_vertices -engine method_of_enumerating_partial_results -num number_of_embeddings, in which -d specifies the input of the data graphs and -q specifies the input of the query graphs. The -filter parameter gives the filtering method, the -order specifies the ordering method, and the -engine sets the enumeration method. The -num parameter sets the maximum number of embeddings that you would like to find. If the number of embeddings enumerated reaches the limit or all the results have been found, then the program will terminate. Set -num as 'MAX' to find all results.

Example (Use the filtering and ordering methods of GraphQL to generate the candidate vertex sets and the matching order respectively. Enumerate results with the set-intersection based local candidate computation method):

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter GQL -order GQL -engine LFTJ -num MAX

Input

Both the input query graph and data graph are vertex-labeled. Each graph starts with 't N M' where N is the number of vertices and M is the number of edges. A vertex and an edge are formatted as 'v VertexID LabelId Degree' and 'e VertexId VertexId' respectively. Note that we require that the vertex id is started from 0 and the range is [0,N - 1] where V is the vertex set. The following is an input sample. You can also find sample data sets and query sets under the test folder.

Example:

t 5 6
v 0 0 2
v 1 1 3
v 2 2 3
v 3 1 2
v 4 2 2
e 0 1
e 0 2
e 1 2
e 1 3
e 2 4
e 3 4

Techniques Supported

The filtering methods that generate candidate vertex sets. Note that (1) every filtering method enables LDF by default; and (2) We optimize the filtering method of TurboIso with the technique of CFL.

Parameter of Command Line (-filter) Description
LDF the label and degree filter
NLF the neighborhood label frequency filter
GQL the filtering method of GraphQL
TSO the filtering method of TurboIso
CFL the filtering method of CFL
DPiso the filtering method of DP-iso
CECI the filtering method of CECI

The ordering methods that generate matching order.

Parameter of Command Line (-order) Description
QSI the ordering method of QuickSI
GQL the ordering method of GraphQL
TSO the ordering method of TurboIso
CFL the ordering method of CFL
DPiso the ordering method of DP-iso
CECI the ordering method of CECI
RI the ordering method of RI
VF2++ the ordering method of VF2++

The enumeration methods that find all results.

Parameter of Command Line (-engine) Description
QSI the enumeration method of QuickSI and RI (Algorithm 2)
GQL the enumeration method of GraphQL (Algorithm 3)
EXPLORE the enumeration method of CFL (Algorithm 4)
DPiso the enumeration method of DP-iso (Algorithm 5)
CECI the enumeration method of CECI (Algorithm 5)
LFTJ the set-intersection based local candidates computation

We can execute a query by specifying the filtering method, the ordering method, and the enumeration method. For example, we evaluate the query with the GraphQL algorithm in the following command.

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter GQL -order GQL -engine GQL -num MAX

Apart from evaluating queries with a specific algorithm, we can also execute queries by integrating techniques from different algorithms. For example, we generate the candidates with the label and degree filter, obtain the matching order with the ordering method of CFL, and finally enumerate the results with the set-intersection based local candidates computation method. Note that the source code cannot support the filtering/ordering/enumeration orders of CECI to integrate with other algorithms.

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter LDF -order CFL -engine LFTJ -num MAX

In addition to setting the filtering, ordering and enumeration methods, you can also configure the set intersection algorithms and optimization techniques by defining macros in 'configuration/config.h'. Please see the comments in 'configuration/config.h' for the detail. We briefly introduce these methods in the following.

Macro Description
HYBRID 0 a hybrid method handling the cardinality skew by integrating the merge-based method with the galloping-based method
HYBRID 1 the merge-based set intersection
SI 0 Accelerate the set intersection with AVX2
SI 1 Accelerate the set intersection with AVX512
SI 2 Scalar set intersection
ENABLE_QFLITER the QFilter set intersection algorithm
ENABLE_FAILING_SET the failing set pruning technique in DP-iso

Furthermore, our source code can support the spectrum analysis on a query. Specifically, we evaluate a query by generate a number of matching orders by performing the permutation of query vertices. For each order, we set a time limit for evaluating the query. In the following command line, we generate 1000 matching orders and set the time limit for each order as 60 seconds. Note that you need to first define SPECTRUM in 'configuration/config.h' file before executing the command.

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter LDF -order Spectrum -engine Spectrum -order_num 100 -time_limit 60 -num 100000

Recommendation

We recommend you to use GQL, CFL or DPiso as the filtering method and use GQL or RI as the ordering method. We always recommend you to use LFTJ as the enumeration method. For the large queries, we recommend you to enable the failing set pruning. For the dense data graphs, we recommend you to enable QFilter. Otherwise, use the hybrid set intersection method with AVX2.

In summary, you can execute your workloads with the following parameters as your baseline method in your experiments. First, add the definition "#define ENABLE_FAILING_SET" in configuration/config.h. Second, re-compile the source code. Finally, execute the workloads. Note that in our experiments, we set the target number of results as 100000 and terminate a query if it cannot be completed within 5 minutes (300 seconds). You can adjust these settings according to your requirement.

Execute the program with the "RI":

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter GQL -order RI -engine LFTJ -num 100000

Execute the program with the "GQL":

./SubgraphMatching.out -d ../../test/sample_dataset/test_case_1.graph -q ../../test/sample_dataset/query1_positive.graph -filter GQL -order GQL -engine LFTJ -num 100000

Experiment Datasets

The real world datasets and the corresponding query sets used in our paper can be downloaded here. As the synthetic datasets are large, we do not upload them. You can easily generate them by following the instruction in our paper.

About

In-Memory Subgraph Matching: An In-depth Study by Dr. Shixuan Sun and Prof. Qiong Luo

License:MIT License


Languages

Language:C++ 96.1%Language:Python 3.1%Language:CMake 0.4%Language:C 0.4%