MANDO-Project / ge-sc

MANDO is a new heterogeneous graph representation to learn the heterogeneous contract graphs' structures to accurately detect vulnerabilities in smart contract source code at both coarse-grained contract-level and fine-grained line-level.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TypeError: __init__() missing 1 required positional argument: 'source_path'

Astronaut-diode opened this issue · comments

Hello, I am not sure whether it is caused by the environment problem or the failure to upload your last code update. I followed the require prompt in README and installed the environment, but there are some errors in running the following commands. Here's an example. My environment, called MANDO, runs commands that are copied directly.

=======================================================================
(MANDO) astronaut@dell-PowerEdge-T640:/data/space_station/ge-sc$ python node_classifier.py -ld ./logs/node_classification/cfg/gae/access_control --output_models ./models/node_classification/cfg/gae/access_control --dataset ./experiments/ge-sc-data/source_code/access_control/buggy_curated/ --compressed_graph ./experiments/ge-sc-data/source_code/access_control/buggy_curated/cfg_compressed_graphs.gpickle --node_feature gae --feature_extractor ./experiments/ge-sc-data/source_code/gesc_matrices_node_embedding/matrix_gae_dim128_of_core_graph_of_access_control_cfg_buggy_curated.pkl --testset ./experiments/ge-sc-data/source_code/access_control/curated --seed 1
Using backend: pytorch
Training phase
Getting features
Traceback (most recent call last):
File "node_classifier.py", line 240, in
train_results, val_results = main(args)
File "node_classifier.py", line 56, in main
model = MANDONodeClassifier(args['compressed_graph'], feature_extractor=feature_extractor, node_feature=args['node_feature'], device=device)
TypeError: init() missing 1 required positional argument: 'source_path'

commented

Hello, The source_path parameter was removed at this commit. Please make sure you pull the latest version. Btw, this command still work well from my side.

emm,I really did not understand this question, but now there is a new problem, how to import the data from other papers, I already have tags and source files, but I do not know how to convert them into your format, which I did not find in your submission history or readme.👍

commented

Firstly, Thank you for your comments, we're lacking of some input preprocessing. We will update it.
The current required input are a compressed_graph which can generated by the scripts in process_graphs folder. In addition, when using node_feature are "GAE" or "LINE" or "Node2vec" as input node feature, you have to refer to those papers and generate nodes' features from the compressed_graph. We didn't include "GAE" or "LINE" or "Node2vec" tools inside this repo, we just dump their output of our current dataset to .pkl files in gesc_matrices_node_embedding folder.

Yes, I've already discovered that when creating a new dataset, if you use "GAE" or "LINE" or "Node2vec", the file you need to read doesn't actually exist. Also, I see that your paper should be written about GCN, not GAE.

So I was very curious about how to recreate GAE, LINE, and Node2vec.

Hi, we reused the following GitHub repository with minor modifications to generate the node embeddings of the LINE and Node2vec models.
https://github.com/shenweichen/GraphEmbedding

And the authors' repository with the GAE (or GCN) model.
https://github.com/tkipf/gae

However, please note that the above Github repositories are quite old. It is required to set up a specified environment with some old settings to run them.

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files? Which is the model that you read in when the option is "GAE" or "LINE" or "Node2vec"?

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files?

Yes. You feed the cfg_cg_compressed_graphs.gpickle files (using NetworkX format) to the two libraries to generate the corresponding embedded files.

Thank you. I'll give it a try and hope for the best. :)

I tried it and it worked for line and node2vec, but I couldn't find a suitable interface for conversion on gae and gcn.

The authors seemingly have modified their repository a bit since I forked it. You can check our train function in the code below. Note that changing the path of your GCN models and having a suitable Tensorflow version are required if you want to re-use the code.

from __future__ import division
from __future__ import print_function

import time
import os
import sys

# find path to root directory of the project so as to import from other packages
tokens = os.path.abspath(__file__).split('/')
# print('tokens = ', tokens)
path2root = '/'.join(tokens[:-4])
# print('gae', 'path2root = ', path2root)
if path2root not in sys.path:
    sys.path.append(path2root)

# Train on CPU (hide GPU) due to memory constraints
# os.environ['CUDA_VISIBLE_DEVICES'] = ""

import tensorflow.compat.v1 as tf
import numpy as np
import scipy.sparse as sp

from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
import networkx as nx

# from gae.optimizer import OptimizerAE, OptimizerVAE
# from gae.input_data import load_data
# from gae.model import GCNModelAE, GCNModelVAE
# from gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

from auto_encoders.vgae.gae.optimizer import OptimizerAE, OptimizerVAE
from auto_encoders.vgae.gae.model import GCNModelAE, GCNModelVAE
from auto_encoders.vgae.gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

tf.disable_eager_execution()

def train(input_network, model_name='gcn_ae', emb_dim=16):
    """

    :param input_network: networkx network
    :param model_name: 'gcn_vae' or 'gcn_ae'
    :param emb_dim:
    :return:
    """
    adj = nx.adjacency_matrix(input_network)

    # Settings
    flags = tf.app.flags
    FLAGS = flags.FLAGS
    FLAGS.remove_flag_values(FLAGS.flag_values_dict())

    flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
    flags.DEFINE_integer('epochs', 500, 'Number of epochs to train.')
    # flags.DEFINE_integer('epochs', 2000, 'Number of epochs to train.')
    flags.DEFINE_integer('hidden1', 32, 'Number of units in hidden layer 1.')
    flags.DEFINE_integer('hidden2', emb_dim, 'Number of units in hidden layer 2.')
    flags.DEFINE_float('weight_decay', 0., 'Weight for L2 loss on embedding matrix.')
    flags.DEFINE_float('dropout', 0., 'Dropout rate (1 - keep probability).')

    flags.DEFINE_string('model', model_name, 'Model string.')
    # flags.DEFINE_string('dataset', 'cora', 'Dataset string.')
    # flags.DEFINE_integer('features', 1, 'Whether to use features (1) or not (0).')

    model_str = FLAGS.model
    # dataset_str = FLAGS.dataset

    # Load data
    # adj, features = load_data(dataset_str)

    # Store original adjacency matrix (without diagonal entries) for later
    adj_orig = adj
    adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape)
    adj_orig.eliminate_zeros()

    adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj)
    adj = adj_train

    # if FLAGS.features == 0:
    #    features = sp.identity(features.shape[0])  # featureless

    features = sp.identity(adj.shape[0])  # featureless

    # Some preprocessing
    adj_norm = preprocess_graph(adj)

    # Define placeholders
    placeholders = {
        'features': tf.sparse_placeholder(tf.float32),
        'adj': tf.sparse_placeholder(tf.float32),
        'adj_orig': tf.sparse_placeholder(tf.float32),
        'dropout': tf.placeholder_with_default(0., shape=())
    }

    num_nodes = adj.shape[0]

    features = sparse_to_tuple(features.tocoo())
    num_features = features[2][1]
    features_nonzero = features[1].shape[0]

    # Create model
    model = None
    if model_str == 'gcn_ae':
        model = GCNModelAE(placeholders, num_features, features_nonzero)
    elif model_str == 'gcn_vae':
        model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero)

    pos_weight = float(adj.shape[0] * adj.shape[0] - adj.sum()) / adj.sum()
    norm = adj.shape[0] * adj.shape[0] / float((adj.shape[0] * adj.shape[0] - adj.sum()) * 2)

    # Optimizer
    with tf.name_scope('optimizer'):
        if model_str == 'gcn_ae':
            opt = OptimizerAE(preds=model.reconstructions,
                              labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                          validate_indices=False), [-1]),
                              pos_weight=pos_weight,
                              norm=norm)
        elif model_str == 'gcn_vae':
            opt = OptimizerVAE(preds=model.reconstructions,
                               labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                           validate_indices=False), [-1]),
                               model=model, num_nodes=num_nodes,
                               pos_weight=pos_weight,
                               norm=norm)

    # Initialize session
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    cost_val = []
    acc_val = []

    def get_roc_score(edges_pos, edges_neg, emb=None):
        if emb is None:
            feed_dict.update({placeholders['dropout']: 0})
            emb = sess.run(model.z_mean, feed_dict=feed_dict)

        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        # Predict on test set of edges
        adj_rec = np.dot(emb, emb.T)
        preds = []
        pos = []
        for e in edges_pos:
            preds.append(sigmoid(adj_rec[e[0], e[1]]))
            pos.append(adj_orig[e[0], e[1]])

        preds_neg = []
        neg = []
        for e in edges_neg:
            preds_neg.append(sigmoid(adj_rec[e[0], e[1]]))
            neg.append(adj_orig[e[0], e[1]])

        preds_all = np.hstack([preds, preds_neg])
        labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds_neg))])
        roc_score = roc_auc_score(labels_all, preds_all)
        ap_score = average_precision_score(labels_all, preds_all)

        return roc_score, ap_score

    cost_val = []
    acc_val = []
    val_roc_score = []

    adj_label = adj_train + sp.eye(adj_train.shape[0])
    adj_label = sparse_to_tuple(adj_label)

    # Train model
    for epoch in range(FLAGS.epochs):
        t = time.time()
        # Construct feed dictionary
        feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders)
        feed_dict.update({placeholders['dropout']: FLAGS.dropout})
        # Run single weight update
        outs = sess.run([opt.opt_op, opt.cost, opt.accuracy], feed_dict=feed_dict)

        # Compute average loss
        avg_cost = outs[1]
        avg_accuracy = outs[2]

        roc_curr, ap_curr = get_roc_score(val_edges, val_edges_false)
        val_roc_score.append(roc_curr)

        print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(avg_cost),
              "train_acc=", "{:.5f}".format(avg_accuracy), "val_roc=", "{:.5f}".format(val_roc_score[-1]),
              "val_ap=", "{:.5f}".format(ap_curr),
              "time=", "{:.5f}".format(time.time() - t))

    print("Optimization Finished!")

    roc_score, ap_score = get_roc_score(test_edges, test_edges_false)
    print('Test ROC score: ' + str(roc_score))
    print('Test AP score: ' + str(ap_score))

    feed_dict.update({placeholders['dropout']: 0})
    emb = sess.run(model.z_mean, feed_dict=feed_dict)
    # print('type(emb) = ', type(emb))
    # print('emb.shape = ', emb.shape)
    return emb

# emb = train(nx.karate_club_graph(), emb_dim=32)
# emb = np.asmatrix(emb)
# print('type(emb) = ', type(emb))
# print('emb.shape = ', emb.shape)

Ok, according to your source code, I have run it, but the problem is that it exceeds the memory limit. I have tried many methods, but all failed to reduce the consumption, so I have to give up temporarily.
It is worth mentioning that your model is also quite large for memory consumption, no criticism, hah! :)

Yes, executing the GCN model on our contract graphs requires a powerful GPU resource (We had used Nvidia A100-16GB). However, if you only want quick results, I suggest focusing on the node features generated by node-type one-hot vectors, LINE, and node2vec models. Based on our experiments, the results from these settings are often better than the ones from the GCN model. Besides, the settings can run with limited GPU resources, especially in the LINE model designed for the vast graph structure.

Thank you, I've successfully reproduced everything except the gae part, and it's great.