A graph convolutional neural network transformer model to predict HOMO-LUMO gap for the OGB Large Scale Challenge
The project natively supports distributed training on single or multi gpu machines, with simultaneous logging of experiments by means of wandb.
Install the base requirements by running
pip install -r requirements.txt
The project also need pytorch (preferably the gpu version). To find the version that suits your system, visit the official pytorch website Furthermore, this project supports multi-gpu multi-machine training by means of the accelerate library. To set it up, after the installation of accelerate by means of pip, run
accelerate config
This will prompt a set of questions in order to properly set up the multi gpu training. It is important to answer to the question 'How many processes in total will you use?' with the number of gpus that will be employed during training. Otherwise the library fails to properly configure the environment for multi gpu training(this holds for accelerate 0.6.2, other versions might be subject to changes).
As shown on the website of the OGB challenge, the website can be automatically downloaded by means of the ogb library. Natively this code will download or search for such data at the location pointed by the path given as --root_dir command line argument.
To reap the benefits of distributed training the following has to be run after accelerate config
is.
accelerate launch train.py command_line_args
All command line arguments that would be passed to train.py are specified as is usual in place of the command_line_args string (e.g.)
accelerate launch train.py --num_epochs 200 --n_heads 3 --node_emb 128
The following parameters can be set to run model training on a variety of model configurations
hidden_channels
input dimension of the Q,K,V matrices of each TransformerConv module. The output size is set equal to hidden_channels for simplicity, although it must be mentioned that each Q,K,V output is of size n_heads*hidden_channels as per the layer definition.node_emb
: dimension of the embedding of node featuresedge_emb
:dimension of the embedding of edge featuresn_heads
: number of heads of the TransformerConv module employed in the model architecture. It influences the output size of each Q,K,V matrix (e.g. n_heads=2 implies a doubled size of the output each of those matrices)
num_epochs
: Number of full iterations to perform over the whole datasetroot_dir
: path to where the dataset is located(or the place to which it will be automatically downloaded)batch_size
: number of graphs to batch togethercriterion
: MAE or MSE, objective function used to optimize the model parameterslr
: learning ratebeta1
: first order coefficient that weighs the history of past gradientsbeta2
: second order coefficient that weighs the history of past gradients
num_workers
: number of processes to instantiate for multi-process data loadingfp16
: whether to enable fp16 training, faster but less accuratemixed_precision
: balanced tradeoff between fp16 and fp32 only trainingcpu
: whether to run on cpu(by default training will run as configured with the call to accelerate config)wandb
: whether to store and automatically visualize training logs on wandbwandb_entity
: is the account that is used to store the wandb logs ifwandb
is specified.