abaxi / bespoke-icdm18

An implementation of Bespoke, a semi-supervised community detection algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Author email: bakshi.11@osu.edu

#############################################
#####Dependencies
#############################################
python 2.7
numpy 1.11.0
scipy 0.17.0
scikit learn 0.16.0
pylab 1.5.1

#############################################
#####Cite
#############################################
Bakshi, Arjun, Srinivasan Parthasarathy, and Kannan Srinivasan. 
"Semi-Supervised Community Detection Using Structure and Size." 
2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018.


#############################################
#####Input file and formats
#############################################
Community file:
1 community per line. Line should contain TAB (\t) separated list of node IDs in that community.
Node IDs need to be integers. Example:
------------------
1   22  34  102 45
243 43  12  55  324
...
..
.
------------------

Graph file:
1 edge per line. A line should have 2 nodes IDs separated by a TAB(\t). Node IDs need to be integers.
Example:
------------------
1 2
44  4
12  43
...
...
.
------------------

Node labels file:
1 node-label pair per line. Node ID followed by numeric/integer label. If there are N distinct labels,
then the labels should range from 0 to N-1. Should be TAB separated.
Node IDs and labels need to be integers. Ex:
------------------
1   0
22   4
13   4
4   4
11  3
14  0
...
..
.
------------------

#############################################
#####Generating node labels file
#############################################
If you do not have a node label file, one can be generated using the "label_nodes.py" script.
Usage:  python label_nodes.py <graph_src> <#node labels> <output filename> [--verbose]
Generally, setting #node labels to 4 or 5 works best.

#############################################
#####Running Bespoke
#############################################
Bespoke can be run using the "run_bespoke.py" python file.
usage: python run_bespoke.py nw_src tr_src num_find ls op [--np NP] [--eval_src EVAL_SRC]

positional arguments:
  nw_src               Graph file. One edge per line.
  tr_src               File containing training comms.
  num_find             Number communities to extract.
  ls                   File containing node labels.
  op                   Write discovered communities to this file.

optional arguments:
  --np NP              Number of subgraph patterns to extract. Default=5
  --eval_src EVAL_SRC  File containing comms. to evaluate against.
  -h, --help           show this help message and exit


#############################################
#####Example: Running Bespoke
#############################################

Download the dblp graph and top5000 communities dataset from Standford's SNAP website. 
On ubuntu/mac/linux run the following commands:

1. python label_nodes.py com-dblp.ungraph.txt 4 dblp_labels.txt --verbose
That will  generate a file with node to label mapping called dblp_labels.txt
Output:
### Generating node labels ###
processing  com-dblp.ungraph.txt
graph loaded in  3.55 seconds
-----------------
jaccard percentiles computed in  42.5 seconds
Jaccard node features extracted/written in  0.27 seconds
-----------------
node features loaded
nodes clustered
Node labels written to file: dblp_labels.txt
###

2. tail -n 100 com-dblp.top5000.cmty.txt > train.txt
Use last 100 communities as training data

3. head -n -100 com-dblp.top5000.cmty.txt > test.txt
Use everything except 100 last 100 as test or evaluation dataset

4.python run_bespoke.py com-dblp.ungraph.txt train.txt 50000 dblp_labels.txt op.txt --eval_src test.txt
Run bespoke. Generate 50k community guesses, evaluate them against communities in test.txt
Output:
#### Input arguments ###
eval_src --> test.txt
tr_src --> train.txt
num_find --> 50000
ls --> dblp_labels.txt
np --> 5
nw_src --> com-dblp.ungraph.txt
op --> op.txt
####


###  Beginning Bespoke ###
Training...
Training complete...
Beginning extraction...
Extraction complete.
total_time(s): 116.09
training_time(s): 103.69
Writing extracted communities to file: op.txt ... done
###  End of Bespoke ###

Scoring extracted communities (may take a lot of time)...
F1 Score: 0.541

######


I would also recommend looking at another user's implementation of Bespoke. It is supposed to be faster overall.
Link: https://github.com/yzhang1918/bespoke-sscd

About

An implementation of Bespoke, a semi-supervised community detection algorithm

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%