ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compiling MACE model for LAMMPS

anarber opened this issue · comments

I have trained a MACE model with the recommened input parameters and would like to use it for molecular dynamics in LAMMPS. Following the documentation, I tried to compile the trained model for use in LAMMPS using the script provided in the devel branch (create_lammps_model.py). When running this script on a cluster, I get the following error:

/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "~/mace/scripts/create_lammps_model.py", line 7, in <module>
    model = torch.load(model_path)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1046, in _load
    result = unpickler.load()
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1016, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1001, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 176, in default_restore_location
    result = fn(storage, location)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 152, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 136, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I was able to train on GPUs, but it isn't recognizing CUDA for this script. Would you please let me know if you've encountered this error before and any suggestions on how to fix it? If you need any more information, please let me know.

Thanks for the interest. The first thing to try is to use the exact same machine and environment (modules, conda if applicable, etc) as you did when training. Is this the case?

Thanks for the quick response. Yes, that is the case. Same hpc/environment for training and model evaluation worked fine.

That surprises me but I believe you. This part suggests otherwise

but torch.cuda.is_available() is False

What happens if you try

import torch
torch.cuda.is_available()

in that environment?

There seems to be a conflict in the python versions. I'll check it out and get back to you by next week. Thanks for your help!

I fixed the python dependency issue, but I am still getting an error. I noticed the create_lammps_model.py script is no longer on the github, but I uploaded the script I had from a previous version. This is being run on the YOUNG cluster at UCL, and on ARC at Oxford. Both are giving the same error. Any suggestion would be much appreciated. Thanks!

Here is how I installed mace:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate
pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
git clone https://github.com/ACEsuit/mace.git
pip3 install ./mace

Then this is my submission script:
#!/bin/bash -l
#$ -cwd
#$ -P Free
#$ -A MCC_pool
#$ -l h_rt=00:10:00
#$ -l gpu=1
#$ -l mem=10G
#$ -N MACE

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load pytorch/1.11.0/gpu
module load compilers/gnu/4.9.2

source ~/py-venv/mace/bin/activate

~/py-venv/mace/bin/python3 ~/mace/scripts/create_lammps_model.py ~/MACE.model

The error message:

/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/nn/modules/container.py:487: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
Traceback (most recent call last):
File "/mace/scripts/create_lammps_model.py", line 10, in
lammps_model_compiled = jit.compile(lammps_model)
File "
/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
File "
/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
[Previous line repeated 3 more times]
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 111, in compile
mod = torch.jit.script(mod, **script_options)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_script.py", line 1265, in script
return torch.jit._recursive.create_script_module(
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 454, in create_script_module
return create_script_module_impl(nn_module, concrete_type, stubs_fn)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 520, in create_script_module_impl
create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 371, in create_methods_and_properties_from_stubs
concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
Only constant Sequential, ModueList, or ModuleDict can be used as an iterable:
File "
/py-venv/mace/lib/python3.9/site-packages/mace/modules/symmetric_contraction.py", line 220
)
for i, (weight, contract_weights, contract_features) in enumerate(
zip(self.weights, self.contractions_weighting, self.contractions_features)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
):
c_tensor = contract_weights(

Hi! Can you please make sure you are using the develop branch of mace for LAMMPS evaluation?

[mace]$ git branch
*develop
main

Does that look right?

Yes - that does look right, although this part (copied from above)

git clone https://github.com/ACEsuit/mace.git
pip3 install ./mace

suggests you installed main.

I noticed the create_lammps_model.py script is no longer on the github

It's been moved to /mace/cli, but the instructions haven't been updated - thanks for noticing.

Hi, I redid the installation using the same method as previously written with the exception of installing the develop branch with the command: git clone https://github.com/ACEsuit/mace.git --branch develop. Now git branch only returns develop; however, I am still getting the same error.

Thanks for your patience. I'm not totally sure.

@ilyes319, torchscript seems to be complaining about this part:

for i, (weight, contract_weights, contract_features) in enumerate(
zip(self.weights, self.contractions_weighting, self.contractions_features)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
):
c_tensor = contract_weights(

Can you think of anything (here or elsewhere) that might have changed in recent merges to cause this?

Mmmm this file has not changed in ages. It could be the pytorch version. What is your pytorch version @anarber ?

$ pip3 show torch
Name: torch
Version: 2.1.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: ~/py-venv/mace/lib/python3.9/site-packages
Requires: sympy, nvidia-nvtx-cu12, fsspec, nvidia-cusolver-cu12, nvidia-cufft-cu12, triton, filelock, nvidia-cudnn-cu12, typing-extensions, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, jinja2, networkx, nvidia-cuda-runtime-cu12
Required-by: torch-ema, opt-einsum-fx, mace, e3nn

I have never tried with the 2.1.0, can you try installing the 2.0?

Sure, I'll give it a try. Thanks for your help!

I'm getting the same error with torch 2.0.0.

Can you run the test in test/test_models.py and see if you get the same error. Was your model trained with 2.1? If it is the case, can you try to train it with 2.0?

Hi, so running test_models.py does not give any error when run with torch 2.0. Yes, the model was initially trained on 2.1, and I have just trained with 2.0 also without any errors.

And the 2.0 model gives an error when compiling?

Yes, it looks like the same error to me.

/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/nn/modules/container.py:487: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
Traceback (most recent call last):
  File "/home/mmm1217/mace/scripts/create_lammps_model.py", line 10, in <module>
    lammps_model_compiled = jit.compile(lammps_model)
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  [Previous line repeated 3 more times]
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 111, in compile
    mod = torch.jit.script(mod, **script_options)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_script.py", line 1265, in script
    return torch.jit._recursive.create_script_module(
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 454, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 520, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 371, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
Only constant Sequential, ModueList, or ModuleDict can be used as an iterable:
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/mace/modules/symmetric_contraction.py", line 220
        )
        for i, (weight, contract_weights, contract_features) in enumerate(
            zip(self.weights, self.contractions_weighting, self.contractions_features)
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        ):
            c_tensor = contract_weights(

Can you give me a model so I can try to reproduce it.

MACE_CsPbI3.model.zip
Hi, I have attached my model. Thanks!

It works for me on my setup. Here is the compiled model. Maybe can you try to restart from a fresh environment, making sure it is a 2.0 torch model.
MACE_CsPbI3_jit.zip

Just to confirm, does this installation method look correct? I will try one more time from scratch and let you know.

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate
pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
pip3 install torch==2.0.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/ACEsuit/mace.git --branch develop
pip3 install ./mace

That looks good to me

I managed to compile the LAMMPS model. I will copy my code for installing mace and running the script. There is still a pytorch warning that does not prevent the code from running, but I will include it below just so you are aware. Thanks for your help with this. I am happy for you to close the issue now.

Install mace:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load compilers/gnu/4.9.2

python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate

pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
pip3 install torch==2.0.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

git clone https://github.com/ACEsuit/mace.git --branch develop
pip3 install ./mace

Run create_lammps_model.py:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0 
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load compilers/gnu/4.9.2

source /~/py-venv/mace/bin/activate

/~/py-venv/mace/bin/python3 /~/mace/scripts/create_lammps_model.py /~/MACE_CsPbI3.model

Remaining pytorch warning:

"MACE.e1187170" 2L, 372C
/~/py-venv/mace/lib/python3.9/site-packages/torch/jit/_check.py:172: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "

Hi @anarber thanks - this kind of detailed reporting is really very helpful. Am I interpreting correctly that the main difference in your final attempt is to load the system cuda and cudann before creating the virtual environment? And is this on Young or your Oxford cluster?

This is all on YOUNG. I did change how I set up the environment, but after more checks, it looks like the primary issue was with the slurm script. I was loading a module version of pytorch which was incompatible. After removing the pytorch import from slurm the create_lammps_module.py script works both in the current mace environement as well as in the previous version I had initially used. Another member of my group has used the current method to run mace successfully on YOUNG as well.

Awesome that it works, I am closing now.