weboftruth is a project to use deep representation learning to learn fact embeddings, with applications to social science and disinformation.
Formally, a Knowledge Graph consists of a number of facts where each fact is a relation
edge connecting a head
entity and a tail
entity. Existing packages like torch-kge
allow you to learn vector representations for entities and relationships given their context in the training data.
weboftruth adds value by:
- letting you easily corrupt your training data to examine the effects on the resulting embeddings
- evaluating a 'truth prediction task' on unseen facts
This repo contains Python and bash scripts for training Knowledge Graph embeddings using head-relation-tail triples.
Two embedding spaces are created (one for Entities (Subjects/Objects) and one for Relationships/Verbs). Extensive use is made of the package torchkge
that implements KGE algorithms like TransE. It is built on PyTorch.
-
Clone this repo - a useful set of standard Knowledge Graph datasets compiled by Github user
simonepri
(many thanks)OR
-
If using your own dataset, organize it as follows:
- Ensure your Knowledge Graph dataset has a finite set of discrete entities and a finite set of discrete relationships as a Tab-Separated-Value file.
- Format:
head\trelation\ttail
- Example line:
India\tlocatedIn\tAsia
- DO NOT include a header line
- Format:
- Create a folder
{dataset_name}
at a location{datapath}
- Split your KG into
train
,test
, andvalidation
sets - Write them to
{datapath}/{dataset_name}/edges_as_text_train.tsv
,edges_as_text_test.tsv
andedges_as_text_valid.tsv
respectively
- Ensure your Knowledge Graph dataset has a finite set of discrete entities and a finite set of discrete relationships as a Tab-Separated-Value file.
-
Run a command such as the one below. Customize as required.
python ./weboftruth/weboftruth/wotmodels.py \
-e 200 \
-m 'TransE' \
-lr 0.00005 \
-dp ./datasets-knowledge-embedding \
-ds 'KINSHIP' \
-mp ./weboftruth/models \
-ts 80
Flag meanings:
e
: number of epochsm
: model name, as 'TransE', 'DistMult' or any of the others provided by torchkgelr
: learning ratedp
: datapath, path to the directory containing your datasetsds
: dataset name, this should be a subdirectory ofdp
mp
: modelpath, path to the directory containing your saved modelsts
: truth-share, a parameter which causes(100-ts)
% of the training set is corrupted before learning begins
- We trained Entity and Relationship embeddings for a cleaned dataset of SVO triples constructed from Wikipedia sentences.
- We trained three configurations - embeddings of dimension 50, 100 and 200.
- A binary classifier (logistic regression) on concatenated SVO embeddings recorded 79% accuracy at telling true triples apart from false (negatively sampled) ones
This was part of a coursework for CAPP 30255 at the University of Chicago by Aabir Abubaker Kar Adarsh Mathew
-
data
contains data sources used, specifically the SVO dataset of subject-verb-object triples from Wikipedia, and our constructed 'partially true' datasets
-
logs
execution logs for RCC and AWS runs
-
notebooks
Jupyter notebooks used for early prototyping
-
shscripts
shell scripts used for RCC
-
weboftruth
Python code organized into a package. Borrows heavily from torch and torchkge
- utils.py: miscellaneous useful functions for loading models, data etc.
- corrupt.py: code for generating the partially true datasets and writing to disk
- svofunctions.py: tools to translate triples to indices and vice versa
- wotmodels.py: tools to build and train models from torchkge
- evaluator.py: functions to evaluate the performance of embeddings as features in truth prediction