graspologic-org / graspologic

Python package for graph statistics

Home Page:https://graspologic-org.github.io/graspologic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Running Leiden community detection on random graph generates multiple communities

johandahlberg opened this issue · comments

Hi!

I'm trying to run Leiden community detection on a randomly generated graph (as part of a test suite for a larger project). I would expect that running community detection on this graph would give me a single community back, but that does not seem to be the case. My expectation in this case is based on previously having done the same thing with igraph and getting a single community back. I can get it to return a single community by adjusting the resolution parameter, but I would prefer not to change that since I need this to work automatically later for the use-case I'm working on.

My question is if this is a reasonable expectation? And are there any parameters other than resolution that I should look into in order to get a single community back for this case?

I have attached the code I've written to explore this below to make it clearer what I'm thinking of.

import networkx as nx
from graspologic.partition import leiden
from collections import defaultdict
from pprint import pprint

graph = nx.fast_gnp_random_graph(100, p=0.1, seed=10)
# Pick the largest component to get rid of any unconnected nodes
graph = max([graph.subgraph(c).copy() for c in nx.connected_components(graph)],
            key=lambda g: len(g.nodes()))

node_mappings = leiden(
    graph,
    use_modularity=True,
)
def node_mappings_to_communities(node_mappings):
    communities = defaultdict(set)
    for node, community in node_mappings.items():
        communities[community].add(node)
    for _, nodes in communities.items():
        yield nodes

communities = list(node_mappings_to_communities(node_mappings))
print("nbr of communities", len(communities))
print("communities:")
pprint(communities,width=200)

The result of this is:

nbr of communities 6
communities:
[{10, 16, 24, 26, 34, 35, 43, 52, 59, 65, 66, 67, 68, 70, 75, 79, 80, 84, 86, 87},
 {0, 5, 6, 8, 13, 18, 19, 20, 29, 30, 39, 44, 45, 46, 48, 53, 55, 57, 62, 69, 71, 78, 85, 90, 94, 97},
 {1, 9, 14, 21, 23, 25, 27, 33, 36, 38, 40, 42, 47, 61, 63, 73, 76, 77, 83, 89, 91, 92, 95, 98, 99},
 {64, 2, 37, 7, 41, 60, 54, 56, 88, 28, 31},
 {96, 32, 4, 11, 12, 81, 51, 93},
 {3, 72, 74, 15, 49, 50, 82, 17, 22, 58}]

To try to make sense of this I also plotted the graph and the resulting communities, and it's difficult for me to understand why the communities were assigned the way they are in this graph.

import matplotlib.pyplot as plt
import networkx as nx
import matplotlib.colors as mcolors

pos = nx.spring_layout(graph, seed=3113794652)

options = {"node_size": 40}
for idx, community in enumerate(communities):
    node_list = [vertex for vertex in community]
    nx.draw_networkx_nodes(graph,
                           pos,
                           nodelist=node_list,
                           node_color=list(mcolors.TABLEAU_COLORS.keys())[idx],
                           **options)

nx.draw_networkx_edges(graph, pos, width=1.0, alpha=0.5)

plt.tight_layout()
plt.axis("off")
plt.show()

image

Hi @johandahlberg - unfortunately, this is a known issue with Leiden as well as other modularity maximization methods. These methods aren't doing model selection, and aren't good at "knowing" when the community structure they find is significant or would be expected under a random model as you point out. See for instance in this image of an adjacency matrix after sorting by the partition inferred by Leiden (from https://bdpedigo.github.io/networks-course/community_detection.html#overfitting):

image

You'll notice that those "communities" actually do have more within-community connections than without, so these structures can actually pop up just by chance.

One option is to explicitly form a null distribution (say, sampling a few hundred ER graphs, running Leiden, and computing the modularity) and then comparing your results to such a distribution.

Zhang and Peixoto also wrote this nice article about this problem https://journals.aps.org/prresearch/abstract/10.1103/PhysRevResearch.2.043271 and provide an alternative formulation that relies on a variant of the stochastic block model. Their implementation is in graph-tool here. The tradeoff is I'm guessing it's slower, might be harder to optimize, etc. but could be helpful for your application so I thought I'd mention it.

Thank you for your quick and useful answer @bdpedigo! I'll look further into the methods you suggest.

Absolutely - if it's alright with you I'm going to close this issue since I don't think there's anything to be done in terms of graspologic, but please feel free to post any further questions here.