kj2431 / NGFP

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convolutional Neural Graph Fingerprint

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

This repository is an implementation of Convolutional Networks on Graphs for Learning Molecular Fingerprints in PyTorch.

It includes a preprocessing function to convert molecules in smiles representation into molecule tensors.

Related work

There are several implementations of this paper publicly available:

The closest implementation is the implementation by GUR9000 and keiserlab in Keras. However this repository represents moleculs in a fundamentally different way. The consequences are described in the sections below.

Molecule Representation

Atom, bond and edge tensors

This codebase uses tensor matrices to represent molecules. Each molecule is described by a combination of the following three tensors:

  • atom matrix, size: (max_atoms, num_atom_features) This matrix defines the atom features.

    Each column in the atom matrix represents the feature vector for the atom at the index of that column.

  • edge matrix, size: (max_atoms, max_degree) This matrix defines the connectivity between atoms.

    Each column in the edge matrix represent the neighbours of an atom. The neighbours are encoded by an integer representing the index of their feature vector in the atom matrix.

    As atoms can have a variable number of neighbours, not all rows will have a neighbour index defined. These entries are filled with the masking value of -1. (This explicit edge matrix masking value is important for the layers to work)

  • bond tensor size: (max_atoms, max_degree, num_bond_features) This matrix defines the atom features.

    The first two dimensions of this tensor represent the bonds defined in the edge tensor. The column in the bond tensor at the position of the bond index in the edge tensor defines the features of that bond.

    Bonds that are unused are masked with 0 vectors.

Batch representations

This codes deals with molecules in batches. An extra dimension is added to all of the three tensors at the first index. Their respective sizes become:

  • atom matrix, size: (num_molecules, max_atoms, num_atom_features)
  • edge matrix, size: (num_molecules, max_atoms, max_degree)
  • bond tensor size: (num_molecules, max_atoms, max_degree, num_bond_features)

As molecules have different numbers of atoms, max_atoms needs to be defined for the entire dataset. Unused atom columns are masked by 0 vectors.

Dependencies

  • RDKit This dependency is necessary to convert molecules into tensor representatins, once this step is conducted, the new data can be stored, and RDkit is no longer a dependency.
  • PyTorch Requires PyTorch >= 1.0
  • NumPy Requires Numpy >= 0.19
  • Pandas Optional for examples

Acknowledgements

About

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

License:MIT License


Languages

Language:Python 100.0%