An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.
Read the associated open access article
Predicting the nature and outcome of reactions using computational methods is an important tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP outperforms DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, and reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent.
The best way to start exploring DRFP is on binder. A notebook that gets you started on creating and using DRFP:
A notbook that explains how you can use SHAP to analyse and interpret your machine learning models when using DRFP:
DRFP can be installed from pypi using pip install drfp
. However, it depends on RDKit which is best installed using conda.
Once DRFP is installed, there are two ways you can use it. You can use the cli app drfp
or the library provided by the package.
drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 512
This will create a pickle dump containing an numpy ndarray containing DRFP fingerprints with a dimensionality of 512. To also export the mapping, use the flag --mapping
. This will create the additional file my_rxn_fps.map.pkl
. You can call drfp --help
to show all available flags and options.
Following is a basic exmple of how to use DRFP in a Python script.
from drfp import DrfpEncoder
rxn_smiles = [
"CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
"CCOC(=O)C(CC)c1cccnc1.Cl.O>>CCC(C(=O)O)c1cccnc1",
]
fps = DrfpEncoder.encode(rxn_smiles)
The variable fps
now points to a list containing the fingerprints for the two reaction SMILES as numpy arrays.
The library contains the class DrfpEncoder
with one public method encode
.
DrfpEncoder.encode() |
Description | Type | Default |
---|---|---|---|
X |
An iterable (e.g. a list) of reaction SMILES or a single reaction SMILES to be encoded | Iterable or str |
|
n_folded_length |
The folded length of the fingerprint (the parameter for the modulo hashing) | int |
2048 |
min_radius |
The minimum radius of a substructure (0 includes single atoms) | int |
0 |
radius |
The maximum radius of a substructure | int |
3 |
rings |
Whether to include full rings as substructures | bool |
True |
mapping |
Return a feature to substructure mapping in addition to the fingerprints. If true, the return signature of this method is Tuple[List[np.ndarray], Dict[int, Set[str]]] |
bool |
False |
atom_index_mapping |
Return the atom indices of mapped substructures for each reaction | bool |
False |
root_central_atom |
Whether to root the central atom of substructures when generating SMILES | bool |
True |
include_hydrogens |
Whether to explicitly include hydrogens in the molecular graph | bool |
False |
show_progress_bar |
Whether to show a progress bar when encoding reactions | bool |
False |
Want to reproduce the results in our paper? You can find all the data in the data
folder and encoding and training scripts in the scripts
folder.
@article{probst2022reaction,
title={Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP},
author={Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
journal={Digital Discovery},
year={2022},
publisher={Royal Society of Chemistry}
}