- GPNN is a package for ML prediction of charge density based on the methods described in Solving the electronic structure problem with machine learning by Ramprasad et. al.
- The NMC cathode dataset is used for demonstration. A pretrained model for this dataset is included and can be used for inference as is.
- The workflow is constructed as a (build -> train -> predict) pipeline.
- Build: Construct an HDF5 dataset file given a directory of CHGCAR (VASP) files.
- Train: Train an ML model on the preprocessed data (HDF5 file genereated from the BUILD step).
- Predict: Use the trained ML model to run inference on new structures (CIF -> Predicted CHGCAR)
- Charge density can be predicted orders of magnitude faster (1 to 3 minutes in total) than DFT and for larger systems.
- GPNN's core dependencies are:
- Hydra: Used for config management of hyperparameters during the build, train and predict stages (hydra-core==1.3.2).
- PyRho: Used for charge density management and interpolation (mp-pyrho==0.3.0).
- HDF5: Used for efficient data storage and organization (h5py==3.8.0).
- Dask: Used for distributed massive array manipulation (dask==2023.11.0).
- CuPy: Used in conjunction with Dask for GPU-accelerated array operations (cupy-cuda11x==13.0.0)
- Clone repo
- Create new conda env
- Run
pip install -e .
In order to simplify interdependent configurations, configs are organized into basic categories (data, model, train, inference) and composed by Hydra when scripts are run. Configs should be modified in place (not moved to other locations).
NOTE: Hydra includes a CLI that can be used to modify parameters on-the-fly instead of manually modifying the yaml configs. See "Run Experiments" for examples.
- Main Config:
- Navigate to
GPNN/configs/config.yaml
- This is the high level config that orchestrates the lower level configs (data, inference and training).
- If you will be training your own model, change "expname". Otherwise, leave it as is. This variable is used to select from pretrained models or name new models.
- Change "project_root" to the global path where you cloned the repo (path/to/GPNN).
- Navigate to
- Inference:
- Navigate to
GPNN/configs/inference/default.yaml
- Update the indicated variables
- Navigate to
- Training
- Navigate to
GPNN/configs/train/default.yaml
- Update the indicated variables
- Navigate to
- Data.
- If training on NMC, navigate to
GPNN/configs/data/NMC.yaml
and follow the instructions. - If defining a custom dataset, navigate to
GPNN/configs/data/template.yaml
and follow the instructions.
- If training on NMC, navigate to
Navigate to GPNN/scripts before running all scripts
-
Inference with Pretrained Model (included model is trained on NMC)
- If you completed step 2 under "Prepare Configs", simply run
python predict.py
- Otherwise, run
python predict.py inference.cif_path=<path/to/CIFs> inference.save_dir=<path/to/save/chgcars> inference.shape=<grid_shape> inference.device=<gpu|cpu>
- If you completed step 2 under "Prepare Configs", simply run
-
Train NMC from scratch
- Navigate to
GPNN/configs/config.yaml
- Change "expname" to a name of your choice. This will become the name of your model.
- Run
python train.py
- Your trained model will be stored in
GPNN/models/<expname>
- Navigate to
-
Train on custom dataset
- Navigate to
GPNN/configs/data/template.yaml
and follow the instructions. - You will be loading your CHGCAR files as described, and defining the relevant config parameters.
- Navigate to
GPNN/configs/config.yaml
- Change "expname" to a name of your choice, that best describes the dataset you want to train on. This will become the name of your model. Note, new experiments run under this name, will use the last model checkpoint under ../models/expname,
- Update line 10 of this file GPNN/configs/config.yaml to the name of your dataset.
- Run
python build.py
to build the dataset file. - Run
python train.py
to train a model on this data. - Run
python predict.py
to use your model for inference on CIF files.
- Navigate to
Charge density is represented as a set of scalar values defined on a regularly distributed discrete grid spanning the volume of a unit cell.
A neural network is trained to predict the charge density values based on a representation of the local atomic environment around each grid point. This representation (called the fingerprint) contains information on the location and types of atoms relative to each grid point.
The fingerprint is constructed as a combination of rotationally invariant scalar and vector components. The scalar component captures radial distance information while the vector component captures angular information.
The authors propose a predefined set of Gaussian functions (k) of varying widths (σk) centered about every grid-point (g) to determine these fingerprints. The scalar fingerprint (Sk) for a particular grid-point, g, and Gaussian, k, in an N-atom, single-elemental system is defined as
where rgi is the distance between the reference grid-point, g, and the atom, i, and fc(rgi) is a cutoff function, which decays to zero for atoms beyond the cutoff radius (default 6Å) from the grid-point. Ck is a normalizing constant (see paper for definition).
The vector component is defined as
NOTE: The tensor component described in the paper is ommitted due to the marginal performance improvement at significantly increased computational cost.