THUDM / GCC

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Questions about pretraining subgraphs

Kqiii opened this issue 4 years ago · comments

Kqiii commented 4 years ago

Hi,

May I ask some questions about the pretraining subgraphs?

Why do you apply a (** 0.75 ) operation to the individual node degrees? What is the benefit of this?

GCC/gcc/datasets/graph_dataset.py

Line 86 in 20398aa

degrees = torch.cat([g.in_degrees().double() ** 0.75 for g in self.graphs])
Here the "replace" option is set to True.

GCC/gcc/datasets/graph_dataset.py

Line 89 in 20398aa

self.length, size=self.num_samples, replace=True, p=prob.numpy()

I believe it is likely that some nodes would be sampled twice or more times, which might harm the contrastive training process. For example, if node v is sampled twice, then it would have two query-key pairs: (g_1, g_2) and (g_3, g_4) for v. In contrastive training (g_1, g_2) is regarded as a positive sample, while (g_1, g_3) is considered as negative, though all the four subgraphs, i.e., g_1 to g_4 are sampled from the ego-graph of node v. Would it be better if set this option to False? Or did I misunderstand anything about the contrastive training process?
Why there is a max(self.rw_hops, ....) operation? What is the disadvantage of just using the preset self.rw_hops for each of the nodes? Moreover, why is there also a (** 0.75) operation?

GCC/gcc/datasets/graph_dataset.py

Line 113 in 20398aa

max_nodes_per_seed = max(

Thank you very much!