WSDM_DGL_Challenge

[WSDM Cup 2022: Temporal Link Prediction Task] | [Team: MegaTron]

[WSDM Cup Website link] | [Link to this challenge]

Environment

dgl-cu102==0.7.2
pytorch==1.7.0
sklearn
pandas
numpy
tqdm
...

GPU

Tesla V100 (32GB) * 1

关键路径

查看项目的整个目录树.

Usage

不需要手动下载数据集，直接运行程序即可。

mini-batch train in GPU, full batch inference in CPU.

Convert csv file to DGL graph objects.

python3 csv2DGLgraph.py --dataset A
python3 csv2DGLgraph.py --dataset B

Training.

cd scripts/
bash trainA.sh 
bash trainB.sh

Result.

# middle test
cd outputs/middle/
zip output_middle.zip output_A.csv output_B.csv
# final test
cd outputs/
zip output_final.zip output_A.csv output_B.csv

Result

Date	Method	middle test AUC of A	middle test AUC of B
2022.01.15	R-GAT (最终提交版本)	0.494439	0.497759
2021.12.16	R-GAT (中期提交版本)	0.498004853	0.505898455

感觉是在initial test上过拟合了。。就随便交一个吧。。

Date	Method	Best initial test AUC of A	Best initial test AUC of B
2022.01.15	R-GAT (最终提交版本)	0.6428	0.67784
2021.12.16	R-GAT (中期提交版本)	0.6357	0.61544
2021.12.08	R-GAT	0.62721	0.60426
2021.12.03	minibatch	0.6113	0.58478
2021.11.31	new time encoding	0.57364	0.57479
2021.11.29	数据预处理	0.52814	0.53116
----	raw baseline	0.511	0.510

一些探索和记录（可忽略）

一些问题

参数敏感，只在initial上拟合，未必在最终的test上拟合。B不同的run，结果稍有不同。

关于 initial test 和 middle test 的数据分布问题

分布是否一致关系到initial test上的参考价值，因此 explore.ipynb 探索了每个query的etype, timestamp, src, dst

A: initial (9999 querys) 和 middle (50000 querys) 的数据分布基本一致
B: initial (5704 querys) 和 middle (50000 querys) 的数据分布在etype上, timestamp基本一致，src 和 dst有一些差异，不过src 和 dst的分布比较平均（点的个数没有特别多的）

训练集的数据缺失问题

A: node_feat 缺失占比为 76% ！
B: edge_feat 缺失占比为 57% !

数据预处理

Time encoding：

A: 时间戳为10位十进制数，抽出每一位分别进行nn.Embedding的映射，然后从左到右进行concat得到time_emb。
B: 时间戳为10位十进制数，从左到右优先级依次降低，每列重复10-i次，然后从左到右进行concat得到time_emb。
均舍弃了原始时间戳的第一位1（2021年才是16开头的数）。

node_feat所有缺失值用max+1（417）来填充，包括csv的整行确实
train: ndata['feat']8列分别进行encoding，得到可学习的embedding后，stack+sum
g.edata['feat']进行手动广播，和边进行一一对应
~~def emb_conccat() --> cat(src['emb'], edge_feat_emb, dst['emb])~~

g.edata['feat']缺失值进行0填充
g.ndata['feat']聚合异构图中邻边的edata['feat']

异构图的构造

edge: {('Node', 'e_type', 'Node'): (src, dst)}
edata['ts']: {('Node', 'e_type', 'Node'): (time)}
ndata['feat']: {'Node': 部分节点有特征，部分节点没有特征为全0}
etype_feat: 目前没用到，边类型的特征

edge: {('User', 'e_type', 'Item'): (src, dst)}
edata['ts']: {('User', 'e_type', 'Item'): (time)}
edata['feat']: {('User', 'e_type', 'Item'): 单纯的边特征}
etype_feat: None

# etype = ('User', '1', 'Item')
>>> g.edata['ts'][('User', '1', 'Item')].shape
torch.Size([29457])
>>> g.edata['feat'][('User', '1', 'Item')].shape
torch.Size([29457, 768])
>>> g.edges[('User', '1', 'Item')].data['feat'].shape
torch.Size([29457, 768])

异构图GNN

每种类型的边分别定义GNN算子：

{
   ('User', '1', 'Item'): dgl.nn.SAGEConv(...),
   ...
   ('User', '1_reversed', 'Item'): dgl.nn.SAGEConv(...),
}

时间编码 (time encoding)：

~~时间戳为10位十进制数，抽出每一位乘0.1组成一个10维向量。~~

~~例如，时间戳为1420079360, encoding后变成10维向量为 [0.1, 0.4, 0.2, 0.0, 0.0, 0.7, 0.9, 0.3, 0.6, 0.0]~~

负采样时间戳（random index）--> `t'`

t <= t', label = 1
t > t', label = 0

求解 P(t <= t' | s, d, r)，表示在时间t'之前，从源节点s到目标节点d之前存在r类型的边的概率。最终inference：t_start ~ t_end 之间，从源节点s到目标节点d之前存在r类型的边的概率。

P(t_start <= t <= t_end | s, d, r) = P(t <= t_end | s, d, r) - P(t <= t_start | s, d, r)

训练流程

根据 heterogeneous graph 结构和 ndata['feat']，通过 HGNN 训练得到节点的 node_emb
对于每条边，edge_emb = cat([src_node_emb, dst_node_emb])
正/负采样，得到正样本和负样本的 timestamp 和 label
time encoding 得到 time_emb
对于每条边，cat([edge_emb, time_emb]) 之后，过Linear层得到出现的概率 probs
BCEWithLogitsLoss() + backward()更新参数

一些细节

baseline中，A没有使用 etype_feat，B没有使用 edata['feat']

Tree

.
├── 12.13.png
├── csv2DGLgraph.py
├── data
│   ├── DGLgraphs
│   │   └── Dataset_A.bin
│   ├── test_csvs
│   │   ├── input_A_initial.csv
│   │   └── input_B_initial.csv
│   └── train_csvs
│       ├── edges_train_A.csv
│       ├── edges_train_B.csv
│       ├── edge_type_features.csv
│       └── node_features.csv
├── DGLgraphs
│   ├── Dataset_A.bin
│   └── Dataset_B.bin
├── explore.ipynb
├── LICENSE
├── main.py
├── model.py
├── outputs
│   ├── a.log
│   ├── best_auc_A.pkl
│   ├── best_auc_B.pkl
│   ├── b.log
│   ├── middle
│   │   ├── output_A.csv
│   │   ├── output_B.csv
│   │   └── output_middle.zip
│   ├── output_A.csv
│   ├── output_B.csv
│   ├── output_middle.zip
│   └── output.zip
├── README 2.md
├── README.md
├── scripts
│   ├── trainA.sh
│   └── trainB.sh
├── test_csvs
│   ├── input_A.csv
│   ├── input_A_initial.csv
│   ├── input_A_middle.csv
│   ├── input_B.csv
│   ├── input_B_initial.csv
│   └── input_B_middle.csv
├── train_csvs
│   ├── edges_train_A.csv
│   ├── edges_train_B.csv
│   ├── edge_type_features.csv
│   └── node_features.csv
└── tt.ipynb

ytchx1999 / WSDM_DGL_Challenge

WSDM_DGL_Challenge

Environment

GPU

关键路径

Usage

Result

一些探索和记录（可忽略）

一些问题

关于 initial test 和 middle test 的数据分布问题

训练集的数据缺失问题

数据预处理

异构图的构造

异构图GNN

时间编码 (time encoding)：

负采样时间戳（random index）--> `t'`

训练流程

一些细节

Tree

About

Languages

WSDM_DGL_Challenge

Environment

GPU

关键路径

Usage

Result

一些探索和记录（可忽略）

一些问题

关于 initial test 和 middle test 的数据分布问题

训练集的数据缺失问题

数据预处理

异构图的构造

异构图GNN

时间编码 (time encoding)：

负采样时间戳（random index）--> t'

训练流程

一些细节

Tree

About

Languages

负采样时间戳（random index）--> `t'`