TypeError: init() missing 1 required positional argument: 'source_path'

Question

TypeError: init() missing 1 required positional argument: 'source_path'

Astronaut-diode opened this issue a year ago · comments

Hello, I am not sure whether it is caused by the environment problem or the failure to upload your last code update. I followed the require prompt in README and installed the environment, but there are some errors in running the following commands. Here's an example. My environment, called MANDO, runs commands that are copied directly.

=======================================================================
(MANDO) astronaut@dell-PowerEdge-T640:/data/space_station/ge-sc$ python node_classifier.py -ld ./logs/node_classification/cfg/gae/access_control --output_models ./models/node_classification/cfg/gae/access_control --dataset ./experiments/ge-sc-data/source_code/access_control/buggy_curated/ --compressed_graph ./experiments/ge-sc-data/source_code/access_control/buggy_curated/cfg_compressed_graphs.gpickle --node_feature gae --feature_extractor ./experiments/ge-sc-data/source_code/gesc_matrices_node_embedding/matrix_gae_dim128_of_core_graph_of_access_control_cfg_buggy_curated.pkl --testset ./experiments/ge-sc-data/source_code/access_control/curated --seed 1
Using backend: pytorch
Training phase
Getting features
Traceback (most recent call last):
File "node_classifier.py", line 240, in
train_results, val_results = main(args)
File "node_classifier.py", line 56, in main
model = MANDONodeClassifier(args['compressed_graph'], feature_extractor=feature_extractor, node_feature=args['node_feature'], device=device)
TypeError: init() missing 1 required positional argument: 'source_path'

minhnn · Answer 1 · Wed Mar 01 2023 11:05:15 GMT+0800 (China Standard Time)

Hello, The source_path parameter was removed at this commit. Please make sure you pull the latest version. Btw, this command still work well from my side.

徐敬杰 · Answer 2 · Wed Mar 01 2023 13:37:40 GMT+0800 (China Standard Time)

emm，I really did not understand this question, but now there is a new problem, how to import the data from other papers, I already have tags and source files, but I do not know how to convert them into your format, which I did not find in your submission history or readme.👍

minhnn · Answer 3 · Wed Mar 01 2023 14:52:44 GMT+0800 (China Standard Time)

Firstly, Thank you for your comments, we're lacking of some input preprocessing. We will update it.
The current required input are a compressed_graph which can generated by the scripts in process_graphs folder. In addition, when using node_feature are "GAE" or "LINE" or "Node2vec" as input node feature, you have to refer to those papers and generate nodes' features from the compressed_graph. We didn't include "GAE" or "LINE" or "Node2vec" tools inside this repo, we just dump their output of our current dataset to .pkl files in gesc_matrices_node_embedding folder.

徐敬杰 · Answer 4 · Wed Mar 01 2023 14:59:41 GMT+0800 (China Standard Time)

Yes, I've already discovered that when creating a new dataset, if you use "GAE" or "LINE" or "Node2vec", the file you need to read doesn't actually exist. Also, I see that your paper should be written about GCN, not GAE.

徐敬杰 · Answer 5 · Wed Mar 01 2023 15:06:15 GMT+0800 (China Standard Time)

So I was very curious about how to recreate GAE, LINE, and Node2vec.

Hoang H. Nguyen · Answer 6 · Wed Mar 01 2023 16:28:36 GMT+0800 (China Standard Time)

Hi, we reused the following GitHub repository with minor modifications to generate the node embeddings of the LINE and Node2vec models.
https://github.com/shenweichen/GraphEmbedding

And the authors' repository with the GAE (or GCN) model.
https://github.com/tkipf/gae

However, please note that the above Github repositories are quite old. It is required to set up a specified environment with some old settings to run them.

徐敬杰 · Answer 7 · Wed Mar 01 2023 16:45:05 GMT+0800 (China Standard Time)

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files? Which is the model that you read in when the option is "GAE" or "LINE" or "Node2vec"?

Hoang H. Nguyen · Answer 8 · Wed Mar 01 2023 16:57:05 GMT+0800 (China Standard Time)

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files?

Yes. You feed the cfg_cg_compressed_graphs.gpickle files (using NetworkX format) to the two libraries to generate the corresponding embedded files.

徐敬杰 · Answer 9 · Wed Mar 01 2023 16:59:38 GMT+0800 (China Standard Time)

Thank you. I'll give it a try and hope for the best. :)

徐敬杰 · Answer 10 · Thu Mar 02 2023 14:34:44 GMT+0800 (China Standard Time)

I tried it and it worked for line and node2vec, but I couldn't find a suitable interface for conversion on gae and gcn.

Hoang H. Nguyen · Answer 11 · Thu Mar 02 2023 22:15:32 GMT+0800 (China Standard Time)

The authors seemingly have modified their repository a bit since I forked it. You can check our train function in the code below. Note that changing the path of your GCN models and having a suitable Tensorflow version are required if you want to re-use the code.

from __future__ import division
from __future__ import print_function

import time
import os
import sys

# find path to root directory of the project so as to import from other packages
tokens = os.path.abspath(__file__).split('/')
# print('tokens = ', tokens)
path2root = '/'.join(tokens[:-4])
# print('gae', 'path2root = ', path2root)
if path2root not in sys.path:
    sys.path.append(path2root)

# Train on CPU (hide GPU) due to memory constraints
# os.environ['CUDA_VISIBLE_DEVICES'] = ""

import tensorflow.compat.v1 as tf
import numpy as np
import scipy.sparse as sp

from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
import networkx as nx

# from gae.optimizer import OptimizerAE, OptimizerVAE
# from gae.input_data import load_data
# from gae.model import GCNModelAE, GCNModelVAE
# from gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

from auto_encoders.vgae.gae.optimizer import OptimizerAE, OptimizerVAE
from auto_encoders.vgae.gae.model import GCNModelAE, GCNModelVAE
from auto_encoders.vgae.gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

tf.disable_eager_execution()

def train(input_network, model_name='gcn_ae', emb_dim=16):
    """

    :param input_network: networkx network
    :param model_name: 'gcn_vae' or 'gcn_ae'
    :param emb_dim:
    :return:
    """
    adj = nx.adjacency_matrix(input_network)

    # Settings
    flags = tf.app.flags
    FLAGS = flags.FLAGS
    FLAGS.remove_flag_values(FLAGS.flag_values_dict())

    flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
    flags.DEFINE_integer('epochs', 500, 'Number of epochs to train.')
    # flags.DEFINE_integer('epochs', 2000, 'Number of epochs to train.')
    flags.DEFINE_integer('hidden1', 32, 'Number of units in hidden layer 1.')
    flags.DEFINE_integer('hidden2', emb_dim, 'Number of units in hidden layer 2.')
    flags.DEFINE_float('weight_decay', 0., 'Weight for L2 loss on embedding matrix.')
    flags.DEFINE_float('dropout', 0., 'Dropout rate (1 - keep probability).')

    flags.DEFINE_string('model', model_name, 'Model string.')
    # flags.DEFINE_string('dataset', 'cora', 'Dataset string.')
    # flags.DEFINE_integer('features', 1, 'Whether to use features (1) or not (0).')

    model_str = FLAGS.model
    # dataset_str = FLAGS.dataset

    # Load data
    # adj, features = load_data(dataset_str)

    # Store original adjacency matrix (without diagonal entries) for later
    adj_orig = adj
    adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape)
    adj_orig.eliminate_zeros()

    adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj)
    adj = adj_train

    # if FLAGS.features == 0:
    #    features = sp.identity(features.shape[0])  # featureless

    features = sp.identity(adj.shape[0])  # featureless

    # Some preprocessing
    adj_norm = preprocess_graph(adj)

    # Define placeholders
    placeholders = {
        'features': tf.sparse_placeholder(tf.float32),
        'adj': tf.sparse_placeholder(tf.float32),
        'adj_orig': tf.sparse_placeholder(tf.float32),
        'dropout': tf.placeholder_with_default(0., shape=())
    }

    num_nodes = adj.shape[0]

    features = sparse_to_tuple(features.tocoo())
    num_features = features[2][1]
    features_nonzero = features[1].shape[0]

    # Create model
    model = None
    if model_str == 'gcn_ae':
        model = GCNModelAE(placeholders, num_features, features_nonzero)
    elif model_str == 'gcn_vae':
        model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero)

    pos_weight = float(adj.shape[0] * adj.shape[0] - adj.sum()) / adj.sum()
    norm = adj.shape[0] * adj.shape[0] / float((adj.shape[0] * adj.shape[0] - adj.sum()) * 2)

    # Optimizer
    with tf.name_scope('optimizer'):
        if model_str == 'gcn_ae':
            opt = OptimizerAE(preds=model.reconstructions,
                              labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                          validate_indices=False), [-1]),
                              pos_weight=pos_weight,
                              norm=norm)
        elif model_str == 'gcn_vae':
            opt = OptimizerVAE(preds=model.reconstructions,
                               labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                           validate_indices=False), [-1]),
                               model=model, num_nodes=num_nodes,
                               pos_weight=pos_weight,
                               norm=norm)

    # Initialize session
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    cost_val = []
    acc_val = []

    def get_roc_score(edges_pos, edges_neg, emb=None):
        if emb is None:
            feed_dict.update({placeholders['dropout']: 0})
            emb = sess.run(model.z_mean, feed_dict=feed_dict)

        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        # Predict on test set of edges
        adj_rec = np.dot(emb, emb.T)
        preds = []
        pos = []
        for e in edges_pos:
            preds.append(sigmoid(adj_rec[e[0], e[1]]))
            pos.append(adj_orig[e[0], e[1]])

        preds_neg = []
        neg = []
        for e in edges_neg:
            preds_neg.append(sigmoid(adj_rec[e[0], e[1]]))
            neg.append(adj_orig[e[0], e[1]])

        preds_all = np.hstack([preds, preds_neg])
        labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds_neg))])
        roc_score = roc_auc_score(labels_all, preds_all)
        ap_score = average_precision_score(labels_all, preds_all)

        return roc_score, ap_score

    cost_val = []
    acc_val = []
    val_roc_score = []

    adj_label = adj_train + sp.eye(adj_train.shape[0])
    adj_label = sparse_to_tuple(adj_label)

    # Train model
    for epoch in range(FLAGS.epochs):
        t = time.time()
        # Construct feed dictionary
        feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders)
        feed_dict.update({placeholders['dropout']: FLAGS.dropout})
        # Run single weight update
        outs = sess.run([opt.opt_op, opt.cost, opt.accuracy], feed_dict=feed_dict)

        # Compute average loss
        avg_cost = outs[1]
        avg_accuracy = outs[2]

        roc_curr, ap_curr = get_roc_score(val_edges, val_edges_false)
        val_roc_score.append(roc_curr)

        print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(avg_cost),
              "train_acc=", "{:.5f}".format(avg_accuracy), "val_roc=", "{:.5f}".format(val_roc_score[-1]),
              "val_ap=", "{:.5f}".format(ap_curr),
              "time=", "{:.5f}".format(time.time() - t))

    print("Optimization Finished!")

    roc_score, ap_score = get_roc_score(test_edges, test_edges_false)
    print('Test ROC score: ' + str(roc_score))
    print('Test AP score: ' + str(ap_score))

    feed_dict.update({placeholders['dropout']: 0})
    emb = sess.run(model.z_mean, feed_dict=feed_dict)
    # print('type(emb) = ', type(emb))
    # print('emb.shape = ', emb.shape)
    return emb

# emb = train(nx.karate_club_graph(), emb_dim=32)
# emb = np.asmatrix(emb)
# print('type(emb) = ', type(emb))
# print('emb.shape = ', emb.shape)

徐敬杰 · Answer 12 · Fri Mar 03 2023 13:12:22 GMT+0800 (China Standard Time)

Ok, according to your source code, I have run it, but the problem is that it exceeds the memory limit. I have tried many methods, but all failed to reduce the consumption, so I have to give up temporarily.
It is worth mentioning that your model is also quite large for memory consumption, no criticism, hah! :)

Hoang H. Nguyen · Answer 13 · Fri Mar 03 2023 19:09:02 GMT+0800 (China Standard Time)

Yes, executing the GCN model on our contract graphs requires a powerful GPU resource (We had used Nvidia A100-16GB). However, if you only want quick results, I suggest focusing on the node features generated by node-type one-hot vectors, LINE, and node2vec models. Based on our experiments, the results from these settings are often better than the ones from the GCN model. Besides, the settings can run with limited GPU resources, especially in the LINE model designed for the vast graph structure.

徐敬杰 · Answer 14 · Fri Mar 03 2023 20:35:19 GMT+0800 (China Standard Time)

Thank you, I've successfully reproduced everything except the gae part, and it's great.

TypeError: __init__() missing 1 required positional argument: 'source_path'

TypeError: init() missing 1 required positional argument: 'source_path'