hyperdrive / Green-Marl

A DSL for efficient Graph Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

1 Introduction

Green-Marl [1] is a domain-specific language that is specially designed for graph data analysis. For the further information for the Green-Marl language, refer to the language specification draft [2], which can also be found in this directory in the source package.

'gm_comp' is a compiler for Green-Marl. It reads a Green-Marl file and generates an equivalent, efficient and parallelized C++ implementation, i.e. .cc file. More specifically, the compiler produces a C++ function for each Green-Marl procedure. The generated c++ functions can be compiled with gcc and therefore can be merged into any user application that are compilable with gcc.

The C++ codes that are generated by 'gm_comp' assume the following libraries:

  • gcc (with builtin atomic functions)
  • gcc (with OpenMP support)
  • a custom graph library and runtime (gm_graph)

The first two are supported by any recent gcc distributions (version 4.2 or higher); the third one is included in this source package.

'gm_comp' is also able to generate codes for a completely different target environment (See Section 5).

2 The Package

2-1 Obtaining source package

The source code repository of gm_comp, a Green-Marl compiler is as follows:

http://github.com/stanford-ppl/Green-Marl/

From there, you can get the current source code package in two ways.

  • as a zip file (not recommended) : If you're not familiar with git and do not want to learn it for the time being, you can still download the package as a zip file. Go to the above http address, find the button that is labeled as "zip". Pushing the button, zipped source package will be downloaded momentarily. Note that you have to download the zip package and unzip it by yourself, whenever you want to fetch a fresh version. .

  • via git client: If you already know how to fetch via git, please go ahead from the above repository.

    • git is the name of client program, which allows you to download the source code package from the repository and to maintain it up-to-date. The git client program (git) should have been already installed in any modern linux/unix/cygwin system.

    • Go to any local directory and do the following command; It will create a new sub-directory (named Green-Marl) and download the source code there.

         git clone git://github.com/stanford-ppl/Green-Marl.git
      
    • You can always download most up-to-date package by doing the follwing command at the directory.

          git pull
      
    • (Note) Since current Makefile is far from perfect in detecting dependencies, after pulling a new source it is desired to completely re-build the compiler (or gm_graph library) by doing make clean_all and then doing make.

2-2 Directories

The following are the descriptions of the directories in the pacakge. $(top) denotes the top directory where you put your source package

$(top)/src :  Compiler source code
$(top)/obj :  Object (.o) files during compilation (to be generated)
$(top)/test:  Test routines for compiler development
$(top)/bin :  Compiler binary (gm_comp) (to be generated)
$(top)/etc :  Extra files for enhanced your experience
$(top)/doc :  Documents

$(top)/apps:  Sample Green-Marl applications and runtime library
$(top)/apps/src                  : Sample Green-Marl programs
$(top)/apps/output_cpp           : Directory for Green-Marl to C++ 
$(top)/apps/output_cpp/generated : C++ codes generated by the gm_comp (to be generated)
$(top)/apps/output_cpp/src       : C++ 'main' code for sample applications
$(top)/apps/output_cpp/data      : Directory for data files  (to be generated)
$(top)/apps/output_cpp/bin       : Directory for compiled binaries (to be generated)
$(top)/apps/output_cpp/gm_graph  : gm_graph (C++ runtime and graph library)
$(top)/apps/output_cpp/gm_graph/src  : source  files (.cc) for gm_graph
$(top)/apps/output_cpp/gm_graph/inc  : header  files (.h)  for gm_graph
$(top)/apps/output_cpp/gm_graph/lib  : gm_graph library (.a) (to be generated)
$(top)/apps/output_gps           : Directory for Green-Marl to GPS
$(top)/apps/output_giraph        : Directory for Green-Marl to Giraph

3 The compiler

3-1 System Requirements

The compiler(gm_comp) and runtime(gm_graph) requires the following environments:

  • POSIX environment: linux or unix, (not tested with cygwin)
  • x86 instruction-set architecture (i386 or x86_64)
  • gcc (version >= 4.2), which supports OpenMP (tested on Mac OS with gcc 4.7.1 from MacPorts, Apple's gcc 4.2.1 does not work)
  • g++
  • GNU flex and bison

3-2 Build the Compiler

The following steps build up 'gm_comp', a Green-Marl compiler:

cd $(top)
make compiler         %% check if $(top)/bin/gm_comp has been created successfully.
cd $(top)/bin 
./gm_comp -h          %% this command shows available compiler options

More detailed documentation about the gm_comp compiler can be found in doc/gm_comp.md in this package.

3-3 Installing Syntax Highlighting for vi

Vi users can enable syntax highlighting of Green-Marl programs (.gm files) through the following steps:

cp $(top)/etc/vim/syntax/greenmarl.vim ~/.vim/syntax/
cp $(top)/etc/vim/ftdetect/greenmarl.vim ~/.vim/ftdetect/

3-4 Enable code coverage

To enable code coverage information collection set the CODE_COVERAGE variable in the 'setup.mk.in' to 1. This will tell the compiler to add code coverage information into the compiled code (only works with 'original' GCC and specifically NOT with the LLVM version found on newer OSX systems). Code coverage information can be queried through the 'gcov' tool or an html can be generated through the make 'coverage' target (needs lcov to be installed http://ltp.sourceforge.net/coverage/lcov.php). The 'lcov' tool will generate a html report in the coverage directory. Coverage information will be accumulated across multiple runs and be reseted upon a change of the binary files.

4 Compiling and Executing Sample Applications

4-1 The Applications

A few Green-Marl sample programs are included in the package. The codes can be found in the following location: $(top)/apps/src/

All of the programs are classic graph algorithms implemented in Green-Marl. Take a look at each program (*.gm) and see how each algorithm is described intuitively with Green-Marl.

(a) Betweenness Centrality (bc.gm): this computes (estimated) betweenness centrality of each node in the given graph. As a matter of fact, this version of the algorithm gets an estimated value by performing BFS only from the given set of nodes, instead of all the nodes.

(b) Conductance (conductance.gm): this computes the conductance of the given subset of nodes in a graph. Here, the subset is represented with 'membership' property; i.e., each node has the ID of the group where it belongs.

(c) PageRank (pagerank.gm): this computes pagerank of each node in the given graph.

(d) Strongly connected components (kosaraju.gm) : this finds strongly connected components of a given graph, using Kosaraju's algorithm. The original algorithm performs two DFS traversal of a graph -- one on the original graph, and the other on its transposed graph. However, this version replaces the second DFS with BFS, in order to improve performance.

(e) Vertex Cover (v_cover.gm) : this finds an approximate of minimum vertex cover (i.e. set of edges that coveres as many vertex as possible) through a well-known greedy algorithm. The algorithm is non-deterministic due to lack of tie-breaking rule.

(f) Single-Source Shortest Path (sssp.gm): this file computes shortest distance
to every nodes in the graph from a single node.

(g) Adamic-Adar (adamicAdar.gm) : this computes Adamic-Adar value of each edge, which is often used to predict possible future edges in social networks.

4-2 Compiling Green-Marl Programs with gm_comp

The following steps compiles sample Green-Marl programs into C++:

cd $(top)/apps

%%%% The following file contains the names of Green-Marl programs to be compiled (w/o file extension). 
%%%% The actual Green-Marl programs are located at $(top)/apps/srcs.
cat Programs.mk  

%%%% The following invokes gm_comp and generates c++ implementation (.cc, .h file)
%%%% out of Green-Marl programs (.gm file)
make gen

%%%% Source codes are generated in the following location.
%%%% Check if source codes are successfully generated.
cd $(top)/apps/output_cpp/generated/     

4-3 Compiling Graph and Runtime Library

Every code generated by gm_comp assumes a specific runtime/graph library, namely gm_graph. The source codes of gm_graph library is also included in this package at the following locations:

$(top)/apps/output_cpp/gm_graph/src
$(top)/apps/output_cpp/gm_graph/inc

Regarding data-structure size, gm_graph can be configured in one of follwing three options:

  • 32bit node_id, 32bit edge_id
  • 32bit node_id, 64bit edge_id
  • 64bit edge_id, 64bit edge_id

The above options can be configured by setting up compiler variables NODE_SIZE and EDGE_SIZE in $(top)/apps/output_cpp/common.mk

Once you set-up above compiler flags, gm_graph library can be built through the following steps:

cd $(top)/apps/
make lib

%%%% Check libgmgraph.a has been successfully created.
ls $(top)/apps/output_cpp/gm_graph/lib  

Note that this step is required only once for each package release. Also note that whenever you change the node_id/edge_id size, you have to re-built the gm_graph libaray as well as your applications. A more detailed documentation about gm_graph library can be found in doc_gm_graph.txt [todo]

4-4 Compiling the Generated C++ Files

Before we go, let us remind you that all the codes generated by gm_comp are C++ 'functions' that can be invoked by 'main' application. Thus, this package provides examplar 'main functions' for the sample Green-Marl programs. Each main file of programs is named as ${program}_main.cc

All the 'main functions' do the same things: (1) it reads out the graph (custom binary format) from the given filename, (2) prepares other input data, and (3) invokes the function generated by gm_comp; internally, the function is executed using given number of threads.

Note that the user is free to create his/her own main functions in any case.

Now we can compile the generated code through the following steps:

cd $(top)/apps/
make bin 
cd $(top)/apps/output_cpp/bin   %% check if executables are successfully generated

4-5 Short Cut

In fact, there is a short cut that does all of 4.2~4.4 at one command:

cd $(top)
make apps     %% this does all of 'make lib', 'make gen', and 'make bin' in order. It also compiles the compiler if it has not been compiled before.

4-6 Executing sample programs

To execute sample programs, we need at least one graph instance. For this purpose, let us first synthesis a graph using an external tool named graph_gen. graph_gen is a simple tool that generates syntactic graphs and stores it using binary files. One can get graph_gen from (TODO:). Here, we assume that the graph_gen executable has been successfully built and copied into $(top)/apps/output_cpp/bin/.

 (TODO)
 * Currently graph_gen is created together when gm_graph is built
 * We will make graph_gen a separate package; one routine is polluted by GPL

Now, do the following steps and execute sample programs:

cd $(top)/apps/output_cpp

%%%% Synthesize a uniform random graph (1m nodes, 8m edges).
bin/graph_gen 1000000 8000000 data/u1m_8m.bin 0

%%%% Run conductance algorithm on the uniform graph with one thread.
bin/conduct data/u1m_8m.bin 1

%%%% Run conductance algorithm on the uniorm graph with 8 threads.
bin/conduct data/u1m_8m.bin 8

%%%% Try remaining sample applications (pagerank, bc, kosaraju)
%%%% All of them have the same following command-line arguments
%%%%   <program name> <graph name> <# of threads>

4-7 Adding your own Green-Marl Program

Refer to tutorial document which provides a walk-through to write your own Green-Marl program, to compile it, and to execute it.

5 Generating output for distributed graph analysis frameworks

The usage model of the Green-Marl language is not restricted to the cache-coherent shared-memory multi-processor environment. A program written in Green-Marl can be compiled into another program that runs on a completely different target environment.

'gm_comp' can translate (a subset of) Green-Marl programs into Java codes that run on a Pregel[3]-like framework. Pregel[3] is a framework that enables distributed execution of graph algorithms that are written on its API. However, the way Pregel API is designed is far from the way graph algorithms are normally designed; often the user should re-design his/her algorithm completely, in order to comply with the API.

Green-Marl aims to solve this problem: the user can write his/her algorithm in a natural way, while the compiler automatically re-write the algorithm into the Pregel API.

Currently, gm_comp can translate a certain subset of Green-Marl programs into GPS or Giraph applications. GPS is a custom replicate of Pregel[3], developed at Stanford, with a few enhancements to the original Pregel model. Giraph is an open-source clone of Pregel running on Hadoop. This GPS and Giraph back-ends of gm_comp are still in early development stage.

To see how gm_comp-GPS works, try following steps:

 cd $(top)/apps
 make env=gps PROGS="pagerank avg_teen_cnt"
 
 %%%% Now check the generated (java) files, and see that how differently they look
 %%%% to the C++ implementation.
 cd $(top)/apps/output_gps/generated/
 ls *.java

And for gm_comp-Giraph:

 cd $(top)/apps
 make env=giraph PROGS="pagerank avg_teen_cnt"

 cd $(top)/apps/output_giraph/generated/
 ls *.java

To run the generated Giraph code, follow these instructions.

Note that simple Green-Marl code (e.g. avg_teen_cnt.gm) is turned into a very complicated java class; all these methods are required to comply with the Pregel API. The compiler automatically generates all these methods.

6 License

Copyright (c) 2011-2012 Stanford University, unless otherwise specified. All rights reserved.

This software was developed by the Pervasive Parallelism Laboratory of Stanford University, California, USA.

Permission to use, copy, modify, and distribute this software in source or binary form for any purpose with or without fee is hereby granted, provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of Stanford University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

7 References

[1] S. Hong et al, "Green-Marl: A DSL for easy and efficient graph analysis", ASPLOS 2012
[2] Green-Marl language specification (draft ver. 0.3)
[3] G. Malewicz et al, "Pregel: A System for Large Scale Graph Processing", SIGMOD 2010.

About

A DSL for efficient Graph Analysis

License:Other


Languages

Language:C++ 45.5%Language:C 31.6%Language:Shell 7.5%Language:HTML 6.6%Language:Makefile 3.2%Language:Standard ML 3.1%Language:Java 0.6%Language:Yacc 0.6%Language:CMake 0.4%Language:Python 0.4%Language:Pascal 0.3%Language:Lex 0.2%Language:Vim Script 0.1%Language:Batchfile 0.0%Language:M4 0.0%