Training on False Edges

Question

Training on False Edges

NobleKennamer opened this issue 4 years ago · comments

Hi,

First thank you for releasing your code and congrats on the interesting model!

I was hoping you could clarify your training process. From my reading of the code it doesn't appear that the model ever sees false edges during the training process. Is this correct? If not could you point me to where these false edges are being sampled. Thank you!

G. Salha-Galvan · Answer 1 · Thu Jan 14 2021 19:23:41 GMT+0800 (China Standard Time)

Dear @NobleKennamer,

First of all, thank you very much for your message and your interest!

Yes, our model sees “false edges” during the training process. Actually, as we do full-batch gradient descent in this implementation, the AE and VAE losses will be computed from all node pairs at each training iteration, i.e. all edges and all pairs of unconnected nodes (of the incomplete training graph).

The ground-truth value for each node pair, i.e. the A_{i,j} for all pair (i,j), and the model reconstruction, i.e. the \hat{A}_{i,j}, respectively correspond to the labels and preds entries in optimizer.py.

In practice, as graphs are usually sparse, full-batch learning leads to unbalanced losses where negative terms are more numerous. As a consequence, we also re-weights the positive terms (the true edges) in the loss by multiplying them by a pos_weight factor in weighted_cross_entropy_with_logits (more details here). This pos_weight is inversely proportional to the graph sparsity (see line 187 in train.py).

This training strategy (full-batch with positive term re-weighting) corresponds to the one followed by Thomas Kipf in his original TensorFlow implementation of graph AE and VAE, from which we built our models.

G. Salha-Galvan · Answer 2 · Thu Jan 14 2021 19:53:14 GMT+0800 (China Standard Time)

[The following message is more a discussion on full-batch learning vs sampling]

As a extension of my previous message, I would like to underline that, while full-batch gradient descent permits learning from the entire actual graph instead of sampled approximations, it also suffers from a O(n^2) quadratic complexity which prevents applications to very large graphs.

To speed up computations, you could resort to negative sampling i.e. learn from balanced losses where you would reconstruct all edges or a subset of all edges (say, m edges) but only m randomly sampled unconnected node pairs. This is a simple and quite popular strategy, which is for instance implemented in pytorch_geometric, and I guess this is what you had in mind in your initial question.

From my experience, negative sampling, while speeding-up training, also sometimes lowers the model's final performance w.r.t. full-batch learning. I assume this is due to the fact that we only sample few random unconnected node pairs, and ignore the others. However, reconstructing some of these others might actually be crucial. Imagine that you have 2 nodes with very high degree or centrality. Intuitively, knowing that these two important nodes are not connected in the graph can be quite important for learning the embedding. This is what motivated us to instead approximate losses by reconstructing random subgraphs of "important" nodes in this recent paper.