Datathon2022
Classification accuracy (both on train and val, True Positive and True negative rate trade-off shown in result.png.
-
Data preparation: node_preprocessing.py
-
Training process runable: vis.ipynb or dgl_test.py
-
EDA: eda.ipynb
-
trained weights: model_final.pth
- nodes (number id): webpage
- (22470 linked, 1655 isolated)
- edge: exists if two pages link to each other (132039)
- Page’s text description
- Page type (label)
-
Node features
- labels: provided, 4 types
- Embedding text one-hot vectors
- Use Doc2Vec, decide the output feature dimension based on the raw sentence length
-
Problem Abstraction: Link Prediction in Graph
-
Future improvement: small model — room to increase complexity
- Deeper GraphSAGE
- higher number of channels
- longer text embedding
- Nodes: pages
- Edges: connectivity of pages
- Node feature: label + (embedded) text