Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures
3Mohamed bin Zayed University of Artificial Intelligence
Dynamic PDB is a large-scale dataset that enhances existing prestigious static 3D protein structural databases, such as the Protein Data Bank (PDB), by integrating dynamic data and additional physical properties. It contains approximately 12.6k filtered proteins, each subjected to all-atom molecular dynamics (MD) simulations to capture conformational changes.
Compared with previously existing protein MD datasets, dynamic PDB provides three key advancements:
Extended simulation durations
: Up to 1 microsecond per protein, facilitating a more comprehensive understanding of significant conformational changes.Finer-grained sampling intervals
: 1 picosecond intervals, allowing for the capture of more detailed allosteric pathways.Enriched array of physical properties
: Captured during the MD process, including atomic velocities and forces, potential/kinetic energies, and the temperature of the simulation environment, etc.
The attributes contained in dynamic PDB are listed as follows:
File Name | Attribute | Data Type | Unit |
---|---|---|---|
{protein_id}_T.pkl |
Trajectory coordinates | float array | Å |
{protein_id}_V.pkl |
Atomic velocities | float array | Å/ps |
{protein_id}_F.pkl |
Atomic forces | float array | kcal/mol·Å |
{protein_id}_npt_sim.dat |
Potential energy Kinetic energy Total energy Temperature Box volume System density |
float float float float float float |
kJ/mole kJ/mole kJ/mole K nm³ g/mL |
In addition, the following data are stored during the MD simulation:
File Name | Description |
---|---|
{protein_id}_minimized.pdb |
PDB structure after minimization |
{protein_id}_nvt_equi.dat |
Information in NVT equilibration |
{protein_id}_npt_equi.dat |
Information in NPT equilibration |
{protein_id}_T.dcd |
DCD format for trajectory coordinates |
{protein_id}_state_npt1000000.0.xml |
Status file for MD prolongation |
You can easily get dynamic PDB dataset from our ModelScope repo.
Clone the dataset into ${DATA_ROOT}/dynamicPDB
directory by cmd below:
git lfs install
git clone git clone https://www.modelscope.cn/datasets/fudan-generative-vision/dynamicPDB.git dynamicPDB
Finally, the dataset should be organized as follows:
./dynamicPDB/
|-- 1ab1_A_npt1000000.0_ts0.001
| |-- 1ab1_A_npt_sim_data
| | |-- 1ab1_A_npt_sim_0.dat
| | `-- ...
| |-- 1ab1_A_dcd
| | |-- 1ab1_A_dcd_0.dcd
| | `-- ...
| |-- 1ab1_A_T
| | |-- 1ab1_A_T_0.pkl
| | `-- ...
| |-- 1ab1_A_F
| | |-- 1ab1_A_F_0.pkl
| | `-- ...
| |-- 1ab1_A_V
| | |-- 1ab1_A_V_0.pkl
| | `-- ...
| |-- 1ab1_A.pdb
| |-- 1ab1_A_minimized.pdb
| |-- 1ab1_A_nvt_equi.dat
| |-- 1ab1_A_npt_equi.dat
| |-- 1ab1_A_T.dcd
| |-- 1ab1_A_T.pkl
| |-- 1ab1_A_F.pkl
| |-- 1ab1_A_V.pkl
| `-- 1ab1_A_state_npt1000000.0.xml
|-- 1uoy_A_npt1000000.0_ts0.001
| |-- ...
| `-- ...
`-- ...
We extend the SE(3) diffusion model to incorporate sequence features and physical properties for the task of trajectory prediction.
Specifically, given an initial 3D structure of the protein, the task is to predict 3D structure at the next time step.
We present the predicted 3D structures by our method and SE(3)-Trans.
SE(3) Trans | Ours | Ground Truth |
We present the network architecture, where the predicted 3D structures are conditioned on the amino acid sequence and physical properties.
pip install -r requirements.txt
pip install .
./DATA/
|-- 16pk_A
| |-- 16pk_A.pdb
| |-- 16pk_A.npz
| |-- 16pk_A_new_w_pp.npz
| |-- 16pk_A_F_Ca.pkl
| `-- 16pk_A_V_ca.pkl
|-- 1b2s_F
| |-- ...
| `-- ...
`-- ...
For each protein xxxx_x,
the xxxx_x.pdb is the pdb file for protein;
the xxxx_x.npz is the node features and edge features from OmegaFold; produced by ./data_preprocess/extract_embedding.py
the xxxx_x_new_w_pp.npz is the trajectory of the protein; produced by first ./data_preprocess/post_process.py and then ./data_preprocess/post_process.py;prep_atlas_with_forces.py;
the xxxx_x_F_Ca.pkl is the force of C alpha atoms of the protein; produced by ./data_preprocess/atom_select.py;
the xxxx_x_V_ca.pkl is the velocity of C alpha atoms of the protein; produced by ./data_preprocess/atom_select.py;
Prepare a list of proteins for training in train_proteins.csv as below:
name | seqres | release_date | msa_id | atlas_npz | embed_path | seq_len | force_path | vel_path | pdb_path |
---|---|---|---|---|---|---|---|---|---|
16pk_A | EKKSIN... | 1998/11/25 | 16pk_A | ./DATA/16pk_A/16pk_A_new_w_pp.npz | ./DATA/16pk_A/16pk_A.npz | 415 | ./DATA/16pk_A/16pk_F_Ca.pkl | ./DATA/16pk_A/16pk_V_ca.pkl | ./DATA/16pk_A/16pk.pdb |
... |
Similarly, prepare for the test_proteins.csv
sh run_train.sh
Key arguments in run_train.sh:
data.keep_first: we use frames in [0, data.keepfirst) in each trajectory for training
csv_path: the path for train_proteins.csv
sh run_eval.sh
Key arguments in run_eval.sh:
model_path: path of pretrained models
start_idx: the time index in the trajectory for evaluation
data.test_csv_path: path for test_proteins.csv
We would like to thank the contributors to the OpenFold, OmegaFold.
If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.