MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design

mofdiff is a diffusion model for generating coarse-grained MOF structures. This codebase also contains the code for deconstructing/reconstructing the all-atom MOF structures to train MOFDiff and assemble CG structures generated by MOFDiff.

paper | data and pretained models

If you find this code useful, please consider referencing our paper:

@inproceedings{
fu2024mofdiff,
title={{MOFD}iff: Coarse-grained Diffusion for Metal-Organic Framework Design},
author={Xiang Fu and Tian Xie and Andrew Scott Rosen and Tommi S. Jaakkola and Jake Allen Smith},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=0VBsoluxR2}
}

Installation
Process data
Training
Generating MOF structures
Assemble all-atom MOFs
Relax MOFs
GCMC simulations
Responsible AI FAQ
Contributing
Acknowledgement

Installation

We recommend using mamba rather than conda to install the dependencies to increase installation speed. First install mamba following the intructions in the mamba repository. (Note: a reqirements.txt mirror of env.yml is provided for compatibility with CI/CD; however, we do not recommend building the environment with pip.)

Install dependencies via mamba:

mamba env create -f env.yml

Then install mofdiff as a package:

pip install -e .

We use MOFid for preprocessing and analysis. To perform these steps, install MOFid following the instruction in the MOFid repository. The generative modeling and MOF simulation portions of this codebase do not depend on MOFid.

Configure the .env file to set correct paths to various directories, dependent on the desired functionality. An example .env file is provided in the repository.

For model training, please set the learning-related paths.

PROJECT_ROOT: the parent MOFDiff directory
DATASET_DIR: the directory containing the .lmdb file produced by processing the data
LOG_DIR: the directory to which logs will by written
HYDRA_JOBS: the directory to which Hydra output will be written
WANDB_DIR: the directory to which WandB output will be written

For MOF relaxation and structureal property calculations, please additionally set the Zeo++ path.

ZEO_PATH: path to the Zeo++ "network" binary

For GCMC simulations, please additionally set the GCMC-related paths.

RASPA_PATH: the RASPA2 parent directory
RASPA_SIM_PATH: path to the RASPA2 "simulate" binary
EGULP_PATH: path to the eGULP "egulp" binary
EGULP_PARAMETER_PATH: the directory containing the eGULP "MEPO.param" file

Process data

You can download the preprocessed BW-DB data from Zenodo (recommended). To use the preprocessed data, please extract bw_db.tar.gz into ${oc.env:DATASET_DIR}.

Alternatively, you can download the BW-DB raw data from Materials Cloud to ${raw_path} and preprocess with the following command. This step requires MOFid.

python mofdiff/preprocessing/extract_mofid.py --df_path ${raw_path}/all_MOFs_screening_data.csv --cif_path ${raw_path}/cifs --save_path ${raw_path}/mofid
python mofdiff/preprocessing/preprocess.py --df_path ${raw_path}/all_MOFs_screening_data.csv --mofid_path ${raw_path}/mofid --save_path ${raw_path}/graphs
python mofdiff/preprocessing/save_to_lmdb.py --graph_path ${raw_path}/graphs --save_path ${raw_path}/lmdbs

The preprocessing inovlves 3 steps:

Extract the MOFid for all structures (CPU).
Construct CG MOF data objects from MOFid deconstruction results (CPU or GPU).
Save the CG MOF objects to an LMDB database (relatively fast).

The entire preprocessing process for BW-DB may take several days (depending on the CPU/GPU resources).

Training

training the building block encoder

Before training the diffusion model, we need to train the building block encoder. The building block encoder is a graph neural network that encodes the building blocks of MOFs. The building block encoder is trained with the following command:

python mofdiff/scripts/train.py --config-name=bb

The default output directory is ${oc.env:HYDRA_JOBS}/bb/${expname}/. oc.env:HYDRA_JOBS is configured in .env. expname is configured in configs/bb.yaml. We use hydra for config management. All configs are stored in configs/ You can override the default output directory with command line arguments. For example:

python mofdiff/scripts/train.py --config-name=bb expname=bwdb_bb_dim_64 model.latent_dim=64

Logging is done with wandb by default. You need to login to wandb with wandb login before training. The training logs will be saved to the wandb project mofdiff. You can also override the wandb project with command line arguments or disable wandb logging by removing the wandb entry in the config as demonstrated here.

training coarse-grained diffusion model for MOFs

The output directory where the building block encoder is saved: bb_encoder_path is needed for training the diffusion model. By default, this path is ${oc.env:HYDRA_JOBS}/bb/${expname}/, as defined above. Train/validation splits are defined in splits, with examples provided for BW-DB. With the building block encoder trained to convergence, train the CG diffusion model with the following command:

python mofdiff/scripts/train.py data.bb_encoder_path=${bb_encoder_path}

For BW-DB, training the building block encoder takes roughly 3 days and training the diffusion model takes roughly 5 days on a single NVIDIA V100 GPU.

Generating CG MOF structures

Pretrained models can be found here. To use the pretrained models, please extract pretrained.tar.gz and bb_emb_space.tar.gz into ${oc.env:PROJECT_ROOT}/pretrained.

With a trained CG diffusion model ${diffusion_model_path}, generate random CG MOF structures with the following command, where ${bb_cache_path} is the path to the trained building encoder bb_emb_space.pt, either sourced from the pretrained models or generated as described above.

python mofdiff/scripts/sample.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path}

To optimize MOF structures for a property defined in BW-DB (e.g., CO2 adsorption working capacity) use the following command, where ${data_path} is the path to the processed data data.lmdb, either sourced from the pretrained models or generated as described above.

python mofdiff/scripts/optimize.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} --data_path ${data_path} --property "working_capacity_vacuum_swing [mmol/g]" --target_v 15.0

Available arguments for sample.py and optimize.py can be found in the respective files. The generated CG MOF structures will be saved in ${sample_path}=${diffusion_model_path}/${sample_tag} as samples.pt.

The CG structures generated with the diffusion model are not guaranteed to be realizable. We need to assemble the CG structures to recover the all-atom MOF structures. The following sections describe how to assemble the CG MOF structures, and all steps further do not require a GPU.

Assemble all-atom MOFs

Assemble all-atom MOF structures from the CG MOF structures with the following command:

python mofdiff/scripts/assemble.py --input ${sample_path}/samples.pt

This command will assemble the CG MOF structures in ${sample_path} and save the assembled MOFs in ${sample_path}/assembled.pt. The cif files of the assembled MOFs will be saved in ${sample_path}/cif. If the assembled MOFs came from property-driven optimization, the optimization arguments are saved to ${sample_path}/opt_args.json.

Relax MOFs and compute structural properties

The assembled structures may not be physically plausible. These MOF structures are relaxed using the UFF force field with LAMMPS. LAMMPS has already been installed as part of the environment if you have followed the installation instructions in this README. The script for relaxing the MOF structures also compute structural properties (e.g., pore volume, surface area, etc.) with Zeo++ and the mofids of the generated MOFs with MOFid. The respective packages should be installed following the instructions in the respective repositories, and the corresponding paths should be added to .env as outlined above. Each step should take no more than a few minutes to complete on a single CPU. We use multiprocessing to parallelize the computation.

Relax MOFs and compute structural properties with the following command:

python mofdiff/scripts/uff_relax.py --input ${sample_path}

This command will relax the assembled MOFs in ${sample_path}/cif and save the relaxed MOFs in ${sample_path}/relaxed. The structural properties of the relaxed MOFs will be saved in ${sample_path}/relaxed/zeo_props_relax.json. The mofids of the relaxed MOFs will be saved in ${sample_path}/mofid.

GCMC simulation for gas adsorption

additional installation

To run GCMC simulations, first install RASPA2 (simulation software) and eGULP (charge calculation software). The paths to both should additionally be added to .env as outlined above.

RASPA2 can be installed with pip:

pip install "RASPA2==2.0.4"

You may need to install the following Linux dependencies first:

apt-get update 
apt-get install -yq libgsl0-dev pkg-config libxrender-dev

Install eGULP following the instruction in the repository. The following commands install eGULP in /usr/local/bin/egulp-master:

unzip egulp-master.zip -d /usr/local/bin
cd /usr/local/bin/egulp-master/src && make

Finally, RASPA2 requires a set of forcefield parameters with which to run the simulations. To use our default simulation settings, copy the UFF parameter set from ForceFields into the RASPA2 forcefield definition directory, typically located at ${oc.env:RASPA_PATH}/share/raspa/forcefield.

running simulations

Calculate charges for relaxed samples in ${sample_path} with the following command:

python mofdiff/scripts/calculate_charges.py --input ${sample_path}

This command will output cif files with charge information under ${sample_path}/mepo_qeq_charges.

Run GCMC simulations with the following command:

python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges

The GCMC simulation results will be saved in ${sample_path}/gcmc/screening_results.json.

We have found that RASPA2 may occasionally have trouble reading input files as generated by python. If you encounter errors of the general form Creating molecules for more systems than the maximum allowed then please set the rewrite_raspa_input flag.

python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges --rewrite_raspa_input

Responsible AI FAQ

What is MOFDiff?
- MOFDiff is a deep neural network that models metal organic framework (MOF) 3D structures.
What can MOFDiff do?
- MOFDiff allows you to train and sample from models that yield a coarse-grained representation of a MOF. It also includes functions for reassembly of an atomistic MOF structure from the coarse-grained representation and interaces to other molecular simulation software for evaluation of structural and gas separation properties.
What is/are MOFDiff’s intended use(s)?
- MOFDiff is intended for research purposes only, for the machine learning for porous materials community.
How was MOFDiff evaluated? What metrics are used to measure performance?
- MOFDiff was evaluated on the validity and novelty of the MOF structures sampled from MOFDiff. Additionally, structures optimized for CO2 adsorption were evaluated based on their simulated CO2 adsorption performance.
What are the limitations of MOFDiff? How can users minimize the impact of MOFDiff’s limitations when using the system?
- The provided pretrained models are specific to the BW-DB dataset.
- While MOFDiff may in principle be trained on arbitrary datasets of MOF structures, it has been minimally tested in this capacity. We enable users to train additional models for research purposes. Please see the training instructions and associated publication above.
- MOFDiff has not been tested by real-world experiments to see if the MOF structures it samples are achievable.
- MOFDiff should be used for research purposes only.
What operational factors and settings allow for effective and responsible use of MOFDiff?
- MOFDiff should be used for research purposes only.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Acknowledgement

This codebase is based on several existing repositories:

Leonardo-lyh / MOFDiff