Using fully populated feature matrix

Question

Using fully populated feature matrix

binonteji opened this issue 3 years ago · comments

Thanks, this is a great paper.

As far as, I could understand the feature matrix used in the paper is identity matrix(sparse in nature).

Can, I use my own graph network(N x N) with the corresponding feature matrix(N x F) with the same code?
I am using a fully populated feature matrix.

Do I have to do any changes ?

G. Salha-Galvan · Answer 1 · Fri Jun 11 2021 22:35:11 GMT+0800 (China Standard Time)

Dear @binonteji,

First of all, thank you for your nice message.

Yes! You can load your own graph dataset in edgelist format by adding a condition on top of the load_data function of the input_data.py file, e.g. as follows:

def load_data(dataset):

    if dataset == 'YOUR_OWN_DATASET':
        adj = nx.adjacency_matrix(nx.read_edgelist("../data/YOUR_OWN_DATASET_FILE", 
                                                   delimiter = "YOUR_DELIMITER",    
                                                   create_using = nx.DiGraph()))        
        features = sp.identity(adj.shape[0])

    elif dataset == 'cora':
        [...]

Using your own fully populated feature matrix is also possible. Replacing the above features = sp.identity(adj.shape[0]) by a reading of your file should be enough. The exact line of code will depend on your data format, but, assuming it is a simple csv file, the following command should work (with an import numpy as np on top of the input_data.py file):

features = sp.csr_matrix(np.genfromtxt("../data/YOUR_OWN_FEATURES_FILE.csv", delimiter = "YOUR_DELIMITER"))

Please let me know if you encounter any difficulties/errors.

Guillaume

P.S.: On purpose and for simplicity, I keep the sp.csr_matrix even if your features are technically not sparse.
Without it I guess we would encounter 1-2 errors to fix in the following of the code - I'd need to double check it.

P.S. 2: I'd advice you to double-check that your feature vectors are ordered in a consistent way w.r.t. how nodes from the graph are read/ordered.

binonteji · Answer 2 · Mon Jun 14 2021 20:40:08 GMT+0800 (China Standard Time)

Dear @GuillaumeSalhaGalvan

Thanks for the great suggestion and tried it!

Loaded my network(adjacency matrix) with the populated feature matrix. But, it gives very high loss and low accuracy.
Moreover, I feel loading the highly populated feature matrix in sp.csr_matrix(...) is not an optimized way.

Could you suggest me on both of these points!

**Info:- ** Feature matrix used: 1000x210
Thanks in advance!

G. Salha-Galvan · Answer 3 · Wed Jun 16 2021 23:19:59 GMT+0800 (China Standard Time)

Hi @binonteji,

Thank you for your message.

I'm afraid it's impossible for me to draw a conclusion by only looking at this plot.

First and foremost, instead of the opt.cost and opt.accuracy from training, I would suggest to check the actual performances of your model on link prediction tasks (for instance on the "general directed link prediction" task from the paper a.k.a. task_1 in the code) i.e. the evolution of AUC and AP scores on the validation (and ultimately test) sets. As this code automatically constructs balanced validation and test sets for each task, the AUC and AP should be larger than 0.5, with a score of 0.5 corresponding to a random link predictor.

Then, models involve a bunch of hyperparameters that can strongly impact learning. In particular, I would recommend to carefully tune the learning_rate (crucial!), the lamb value and the dimensions of GCN layers.

Besides, to properly assess whether a score is really "low", I would also try to compare the AUC/AP from gravity AE or VAE to scores obtained with other baselines, such as the standard graph AE/VAE (that are already re-implemented in this repo) or another node embedding method such as node2vec or APP:

if scores are "low" for all these methods: then it might be due to some intrinsic difficulty in your task/data. As I don't know what your graph and your features look like, it's hard for me to confirm it at this stage. Low scores might also be caused by a bug when processing data (see the "P.S. 2" above).
if scores are only "low" for gravity graph AE/VAE but "high" for all methods: then I can try to further investigate with you. Just send me an email if you don't want to publicly talk about your data or your project on GitHub.

G. Salha-Galvan · Answer 4 · Wed Jun 16 2021 23:26:01 GMT+0800 (China Standard Time)

Moreover, I feel loading the highly populated feature matrix in sp.csr_matrix(...) is not an optimized way.

Indeed. I agree. :) This was the quickest solution to make your code run without further modification, but definitely it could/should be optimized.
I would need to take a closer look at it, and see how to properly adapt the call to sparse_to_tuple and then the first GCN layers (all based on GraphConvolutionSparse as in the tkipf/gae repo) in the most efficient way.