PecanPy Benchmark

This reposotory provides scripts for reproducing the benchmarking results of several implementations of node2vec as presented in PecanPy.

Note: all test scripts provided use the SLURM workload manager, and will NOT run on a personal computer; this package uses Anaconda to manage environments through conda environments, so that different software packages can be run despite they might have different dependency requirements.

1. Quick start

To get started and perform all tests, run the following commands which will first setup the working environment and then submit all job scripts for testing.

WARNING: the whole dataset takes up 11GB of sapce, make sure you have enough space before setting up.

sh setup.sh # setup working environment
sh SLURM/submit_all.sh # submitting test job scripts

Note: it might take up to 30 minutes to fully setup, mainly due to download of large data files

After all jobs are done (one might encounter memory errors due to limitations of some implementation), two files will be created under the result/ directory:

stat_summary.tsv - summary table consisting of runtime statistics of different implementations for different networks. More details about the measurements could be found in section 3
evaluation_summary.tsv - summary table for the classification evaluation of the resulting embeddings from different implementations. More details about the eavluation setup could be found in section 4

To generate the runtime stats and evaluation figures using the test data collected, open and run the two jupyter notebooks (runtime_stats.ipynb and classification_evaluation.ipynb) under the jup/ directory. The figures should be produced at the end of each notebook.

2. Setting up

This section describes in more detail the steps that are done when a user runs setup.sh, which includes

setting up directory structure for saving results and history files
building environments for different implementataions
downloading and processing data

2.1 Directory

The script/init_setup/setup_dir.sh script will setup the SLURM_history/ directory for holding SLURM job history files and the result/ with the following

result
|-- emb
|    |-- nodevectors
|    |-- orig-cpp
|    |-- orig-py
|    |-- pecanpy-DenseOTF
|    |-- pecanpy-PreComp
|    `-- pecanpy-SparseOTF
`-- stat
     |-- nodevectors
     |-- orig-cpp
     |-- orig-py
     |-- pecanpy-DenseOTF
     |-- pecanpy-PreComp
     `-- pecanpy-SparseOTF

where emb/ holds the embedding files generated by different implementations, and stat/ holds the runtime stat data (execution time and physical memory usage) profiled by GNU time command. Each of the directory above is further subdivided with one folder per implementations

2.2 Environments

We tested 6 different implementations of the node2vec algorithm with 3 implementations coming from PecanPy and 3 other from alternative software packages.

Original node2vec (Python) - orig-py
Original node2vec (C++) - orig-cpp
PecanPy
- PreComp - pecanpy-PreComp
- SparseOTF - pecanpy-SparseOTF
- DenseOTF - pecanpy-DenseOTF
NodeVectors - nodevectors

Since different libraries require different dependencies that are not compatible with each other, we set up conda conda environments for each of the libraries (except for orig-cpp, which requires building cpp source code instead). Information about the conda environments is provided in the env/ directory as .yml files. The script/init_setup/setup_envs.sh script will use these files to setup the three conda environments to be used later by different libraries

pecanpy-bench_node2vec
pecanpy-bench_nodevectors
pecanpy-bench_pecanpy

2.3 Data

Various networks with a wide range of sizes and densities are used for benchmarking different implementations of the node2vec algorithm. The relatively small networks (BlogCatalog, PPI, Wikipedia) are provided in this repository along with node labels. They are originally downloaded from the node2vec webpage, and converted to .txt files from .mat files so that it is easier to load with Python. The rest of the netowrks will be downloaded from other repositories, which will be automatically done by the script/init_setup/setup_data.sh script. The following table summaries some information about the networks tested.

Network	Weighted	# nodes	# edges	Density (unweighted)	File size
BioGRID	No	20,558	238,474	1.13E-03	2.5M
STRING	Yes	17,352	3,640,737	2.42E-02	60M
GIANT-TN-c01	Yes	25,689	38,904,929	1.18E-01	1.1G
GIANT-TN	Yes	25,825	333,452,400	1.00E+00	7.2G
SSN200	Yes	814,731	72,618,574	2.19E-04	2.0G
BlogCatalog	No	10,312	333,983	6.28E-03	3.2M
PPI	No	3,852	38,273	5.16E-03	707K
Wikipedia	Yes	4,777	92,406	8.10E-03	2.0M

3. Submitting benchmark jobs

Each implementation will be tested using two different resource configurations (multi and single). The multi setup aims to test the capability of the implementation to make use of large computational resources, while the single setup tests the ability of the implementation run on a smaller amount of computational resources (e.g. what might be found on a laptop). The configuration details are the following

Configuration	Core count	Memory (GB)	Time limit (hr)
Multi	28	200	24
Single	1	32	8

As mentioned earlier, SLURM/submit_all.sh will submit all test jobs at once. Each implementation and configuration has its own test script, e.g. SLURM/test_pecanpy-PreComp_single.sb is the batch scrip for testing PecanPy-PreComp with single-core resource configuration.

A note on testing nodevectors implementation: for all other implementation besides nodevectors, the p and q parameters are set to 1 by default, but for nodevectors they are set to 1.001. If p and q are set to 1, nodevectors will automatically perform 1st order random walk as a shortcut (link to source) other than 2nd order walk as required by the node2vec. Performing 1st order walk is significantly faster than performing 2nd order random walk as all other implementations do, and hence providing biased testing results. Setting p and q to 1.001 disable the automatic fall back to 1st order random walk and at the same time keeps the results close to when p and q are set to 1.

After all test jobs are finished, the following Python script ~/script get_test_info.py will be executed to extract test information from logging files and summarize into a single table to ~/result/stat_summary/summary.txt. The summary file can be renamed to specify specific benchmark conditions if needed. The table consists of the following columns:

Network - name of the network embedded
Method - name of the implementation of node2vec
Setup - computational resource setting of the test
Loading time - (stage1) time used to load network into memory in desired format
Preprocessing time - (stage2) time used to pre-compute transition probability table
Walking time - (stage3) time used to generate random walks
Training time - (stage4) time used to train word2vec model using the random walks
Total time - total run time of the program (from starting Python, includes time to load packages, etc.)
Total time in second - same as total time, but converted to seconds
Maximum resident size - maximum physical memory used

4. Classification Evaluation

To ensure the quality of the embeddings, we perform node classification tasks as presented in node2vec for BlogCatalog, PPI, and Wikipedia. Slight modification to the labels for PPI was made to remove labelsets with less than 10 positives to ensure meaningful evaluation metrics. There are 38 node classes in BlogCatalog, 50 node classes in PPI, and 21 node classes in Wikipedia. For each node class in a network, a one vs rest l2 regularized logistic regression model is trained and evaluated through 5-fold cross validation. Each test fold is scored using auROC separately, and the mean auROC score across 5 folds is reported. This evaluation is repeated 10 times and the mean reported scores are taken as the final evaluation score the the class.

After the evaluation, for each network, there should be a list of scores for each implementation, where the entries correspond to evaluation score for a particular node class. A wilcoxon paired test is then applied for each implementation against the original Python implementation to see if there is any significant performance drop using the new implementation.

5. Extending Benchmarks

We welcome any researchers to use this repository to benchmark their own network of interest or even embedding programs. This section provides guidelines for creating new tests and requirements for contributing to the repository by compiling more implementations.

5.1 Procedure for adding new implementation

Add new implementation name to data/implementation_list.txt (see section 5.3 for more requirements of implementation)
Create test job script using the templates provided (SLURM/_test_template.sb and SLURM/_test_template_single.sb), and follow the modifaction tag ### MODIFY to modify the lines as needed

### MODIFY1 - sbatch job name (ends with _s for single-core configuration setup)
### MODIFY2 - name of implementation, need to match that added to data/implementation_list.txt
### MODIFY3 - change directory to source code for new implementation if needed (not required if setup through env)
### MODIFY4 - activte vritual environment for the new implementation if needed
### MODIFY5 - modify command for calling the program to embed network

5.2 Procedure for adding new network

Add new network as edgelist file (.edg) and add the network name (without file extension) to data/networks.txt.
Add true (or false) to data/weighted.txt depending on whether the new network is weighted (or not). The order of network names in data/networks.txt and the corresponding weighted information in data/weighted.txt must match.

5.3 Contributing

Prepare the .yml environment file with the naming convention of pecanpy-bench_new-method where new-method should be replaced by the method's name, and modify script/init_setup/setup_envs.sh to setup environment for the new implementation.
Provide interfacing script to for bash to communicate with python if needed, the following inputs are required
- --input - input graph path
- --output - output embedding path
- --dimension - embedding dimension
- --walk-length - length of each walk
- --num-walks - number of walks per node
- --workers - number of workers
- --p - return parameter
- --q - inout parameter

For proper runtime stat retrieval, add the following printing statements to report execution time for each corresponding stage

print("Took %02d:%02d:%05.2f to load graph"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to pre-compute transition probabilities"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to generate walks"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to train embeddings"%(hrs, mins, secs))

Add data retrieval scripts and any necessary preprocessing scripts for new networks (as edgelist files, with .edg extension)

krishnanlab / PecanPy_benchmarks