mishidemudong / InfoFlow-GraphFrame

GraphFrame implementation of InfoFlow, an Apache Spark community detection algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

InfoFlow

An Apache Spark implementation of the InfoMap community detection algorithm

This is now abandoned.

DataFrame does not perform well with iteration/recursion, at all. The execution plan gets very complicated quickly with each iteration, so that the computation on the driver core becomes the time limiting factor, resulting in ridiculously long runtime.

Theory

This section provides the discrete maths that allow the InfoMap algorithm to be adapted onto Apache Spark, and the development of the parallel version, InfoFlow.

Fundamentals

These are the fundamental maths found in the original paper [Martin Rosvall and Carl T. Bergstrom PNAS January 29, 2008. 105 (4) 1118-1123; https://doi.org/10.1073/pnas.0706851105]:

Nodes

Each node is indexed, with the index denoted by greek alphabets α, β or γ. Each node α is associated with an ergodic frequency pα. In between nodes there may be a directed edge ωαβ from node α to node β. The directed edge weights are normalized with respect to the outgoing node, so that

The nodes are unchanged for all partitioning schemes.

Modules

The nodes are partitioned into modules. Each module is indexed with Latin alphabets i, j, or k.

Each module has ergodic frequency:

and probability of exiting the module is:

with these, we try to minimize the code length:

where

Simplifying calculations

We develop maths to reduce the computational complexity for merging calculations. Specifically, we find recursive relations, so that when we merge modules j and k into i, we calculate the properties of i from those of j and k.

Calculating merging quantities

We can rewrite

as

with

being the exit probability without teleportation.

We can define a similar quantity, the transition probability without teleportation from module j to module k:

Now, if we merge modules j and k into into a new module with index i, the exit probability would be follow

with

and the exit probability without teleportation can be calculated via:

since we are looking at the exit probability of a module, there are no self connections within modules, so that the specification of pαwαβ given α ∈ i, β /∈ i is redundant. Then we have

which conforms with intuition, that the exit probability without teleportation of the new module is equal to the exit probability of all nodes without counting for the connections from j to k, or from k to j.

We can further simplify the maths by expanding the non-inclusive set specification:

Expanding gives:

which by definition is

This allows economical calculations.

We can do similar for wil, if we merged modules j and k into i, and l is some other module:

and similarly for wli:

We can simplify further. The directionality of the connections are not needed, since wij and wji always appear together in Eq. (16). Then, we can define

and we can verify that

combine to give

and this quantity is applied via

The calculations above has a key, central message: that for the purpose of community detection, we can forget about the actual nodal properties; after each merge, we only need to keep track of a module/community.

Calculating code length reduction

Given an initial code length according to

further iterations in code length calculation can calculated via dynamic programming. Whenever we merge two modules j and k into i, with new module frequency pi and qi , the change in code length is:

so that if we keep track of Pi qi , we can calculate ∆L quickly by plugging in pi , pj , pk, qi , qj , qk.

InfoMap Algorithm

The algorithm consists of two stages, the initial condition and the loop:

Initial condition

Each node is its own module, so that we have:

and ∆L is calculated for all possible merging pairs according to

Loop

Find the merge pairs that would minimize the code length; if the code length cannot be reduced then terminate the loop. Otherwise, merge the pair to form a module with the following quantities, so that if we merge modules j and k into i, then: (these equations are presented in previous sections, but now repeated for ease of reference)

and

is recalculated for all merging pairs that involve module i, i.e., for each wi�l. The sum Pi qi is iterated in each loop by adding qi − qj − qk.

Algorithm

Given the above math, the pseudocode is:

Initiation:

  • Construct a table (i.e. row and column), where each row is an undirected edge between modules in the graph. Each row is of format ( (vertex1,vertex2), ( n1, n2, p1, p2, w1, w2, w12, q1, q2, ∆L ) ). The quantities n, p, w, q, are properties of the two modules, n is the nodal size of the module, p is the ergodic frequency of the module, w is the 6 exit probability of a module without teleportation, q is the exit probability of a module with teleportation, ∆L is the change in code length if the two modules are merged. Vertex1 and vertex2 are arranged such that vertex1 is always smaller than vertex2.
  • Construct a table (i.e. row and column), where each row is an undirected edge between modules in the graph. Each row is of format ( (vertex1,vertex2), ( n1, n2, p1, p2, w1, w2, w12, q1, q2, ∆L ) ). The quantities n, p, w, q, are properties of the two modules, n is the nodal size of the module, p is the ergodic frequency of the module, w is the 6 exit probability of a module without teleportation, q is the exit probability of a module with teleportation, ∆L is the change in code length if the two modules are merged. Vertex1 and vertex2 are arranged such that vertex1 is always smaller than vertex2.

Loop:

  • Pick the row that has the smallest ∆L. If it is non-negative, terminate the algorithm.
  • Calculate the newly merged module size, ergodic frequency, exit probabilities.
  • Calculate the new RDD of edges by deleting the edges between the merging modules, and then aggregate all edges associated with module 2 to those in module 1.
  • Update the table by aggregating all rows associated with module 2 to those in module 1. Join the table with the RDD of edges. Since the RDD of edges contain w1�k, we can now calculate ∆L and put it in the table for all rows associated with module 1.
  • Repeat from Step 1. There are O(e) merges, e being the number of edges in the graph.

Merging multiple modules at once

In the previous sections, we have developed the discrete mathematics and algorithm that performs community detection with O(e) loops, based on the key mathematical finding that we only need to remember modular properties, not nodal ones.

The algorithm above does not take advantage of the parallel processing capabilities of Spark. One obvious improvement possibility is to perform multiple merges per loop. However, algorithms so far focus on merging two modules on each iteration, which is not compatible with the idea of performing multiple merges, unless we can make sure no module is involved with more than one merge at once.

Here, rather than focusing on making sure that no module is involved with more than one merge at once, we can explore the idea of merging multiple modules at once. Thus, we can perform parallel merges in the same loop iteration, where possibly all modules are involved in some merge

Mathematics

Here we develop the mathematics to keep track of merging multiple modules at once.

We consider multiple modules Mi merging into a module M. Another way to express this equivalently is to say that a module M is partitioned into i non-overlapping subsets:

Then we can expand the nodal sum over module M into the sum over all nodes in all submodules Mi , the exit probability of the merged module M becomes:

where we expand the second term with respect to the Mj’s to give

Combining the first and third terms,

which we can recognize as

which we can immediately see as linear generalizations of the previous equations, and may be calculated iteratively as the previous algorithm. We can calculate wMiMj by expanding on the partitioning:

so that when we merge a number of modules together, we can calculate its connections with other modules by aggregating the existing modular connections. This is directly analogous to

Thus, the mathematical properties of merging multiple modules into one are identical to that of merging two modules. This is key to developing multi-merge algorithm.

Algorithm

Initiation is similar to the infomap algorithm, so that we have an RDD of modular properties in the format (index,n,p,w,q), and edge properties ((vertex1,vertex2),summed connection weight), ((vertex1,vertex2),deltaL).

Loop:

  • For each module, seek to merge with another module that would offer the greatest reduction in code length. If no such merge exists, the module does not seek to merge. Then, we have O(e) edges.

  • For each of these edges, label it to the module index to be merged to, so that each connected component of the graph has the same label. This has O(k) complexity, k being the size of the connected component. The precise algorithm is described below.

  • Recalculate modular and edge property values via aggregations:

Labeling connected components

This algorithm labels a graph so that all connected components are labeled by the same module index, the module index being the one that occurs the most frequently within the linked edges.

Initiation:

  • Given the edges, count the occurrences of the vertices.
  • Label each edge to one of the two vertices with the higher occurrence.

Loop:

  • Count the label occurrences for each label.
  • For each vertex, find the label with the maximum occurrence associated with it.
  • For each edge, label it according to the vertex with a higher label count.
  • If for each edge, the new labeling is identical to the old, terminate. Else, repeat.

Performance Improvement

Infomap performs greedy merges of two modules until no more merges can be performed. If there are n nodes in the graph and m modules in the end, then there are n − m merges, since we assume the graph is sparse and the edges are proportional to the number of nodes. Since each loop merges two modules, there are n−m−1 loops. Thus, infomap has linear time complexity to the number of nodes/edges.

In the multiple merging scheme, within each loop each module merges with one other module. Let’s assume in each loop, k modules merge into one on average. Then, let there be l loops, and as before, n nodes are merged into m modules. Since each loop reduces the amount of modules by k times, we have

Within each merge, there are O(k) operations to aggregate the indices appropriately for the merges.

Thus, the overall time complexity is O(k log n). For example, if k = 2, i.e., every pair of modules are merged every step, then we have O(2 log n) complexity, i.e., logarithmic complexity in the number of nodes/edges.

In the worst cases, we have linear time complexity, if l = 1, k = n/m, so that the overall complexity is O(k) = O(n/m). Another possibility is if InfoFlow degenerates into InfoMap, so that we merge a pair of modules every step, and of course recover linear complexity

The central reason behind the logarithmic time complexity is the constant average merging of k modules into one. The constancy of this factor likely depends on the network structure. Given a sparse network, the number of edges is similar to the number of nodes. If we make the assumption that k is proportional to the number of edges, then after a loop, the number of modules is reduced by a factor of k, and so is the number of edges. Thus, the network sparsity remains unchanged, and k is unchanged also. Of course, ultimately, actual performance benchmarks are required.

The logarithmic complexity and the k constant suggests that perhaps enforced binary merging, i.e., either pair-wise merge or no merge at all for some modules, might achieve best runtime complexity. A possible catch-22 might be that, to enforce pair-wise merges, O(k) explorations would be needed, so that the runtime complexity remains the same, and actual performance is even penalized. Further mathematical ideas, simulations and benchmarks would be required for further explorations.

Author

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

GraphFrame implementation of InfoFlow, an Apache Spark community detection algorithm

License:MIT License


Languages

Language:Scala 93.2%Language:Shell 6.5%Language:Makefile 0.3%