Compiling MACE model for LAMMPS

Question

Compiling MACE model for LAMMPS

anarber opened this issue 8 months ago · comments

I have trained a MACE model with the recommened input parameters and would like to use it for molecular dynamics in LAMMPS. Following the documentation, I tried to compile the trained model for use in LAMMPS using the script provided in the devel branch (create_lammps_model.py). When running this script on a cluster, I get the following error:

/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "~/mace/scripts/create_lammps_model.py", line 7, in <module>
    model = torch.load(model_path)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1046, in _load
    result = unpickler.load()
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1016, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 1001, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 176, in default_restore_location
    result = fn(storage, location)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 152, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/serialization.py", line 136, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I was able to train on GPUs, but it isn't recognizing CUDA for this script. Would you please let me know if you've encountered this error before and any suggestions on how to fix it? If you need any more information, please let me know.

wcwitt · Answer 1 · Fri Oct 27 2023 00:12:05 GMT+0800 (China Standard Time)

Thanks for the interest. The first thing to try is to use the exact same machine and environment (modules, conda if applicable, etc) as you did when training. Is this the case?

anarber · Answer 2 · Fri Oct 27 2023 00:13:53 GMT+0800 (China Standard Time)

Thanks for the quick response. Yes, that is the case. Same hpc/environment for training and model evaluation worked fine.

wcwitt · Answer 3 · Fri Oct 27 2023 00:29:36 GMT+0800 (China Standard Time)

That surprises me but I believe you. This part suggests otherwise

but torch.cuda.is_available() is False

What happens if you try

import torch
torch.cuda.is_available()

in that environment?

anarber · Answer 4 · Fri Oct 27 2023 00:44:11 GMT+0800 (China Standard Time)

There seems to be a conflict in the python versions. I'll check it out and get back to you by next week. Thanks for your help!

anarber · Answer 5 · Sat Nov 04 2023 00:18:01 GMT+0800 (China Standard Time)

I fixed the python dependency issue, but I am still getting an error. I noticed the create_lammps_model.py script is no longer on the github, but I uploaded the script I had from a previous version. This is being run on the YOUNG cluster at UCL, and on ARC at Oxford. Both are giving the same error. Any suggestion would be much appreciated. Thanks!

Here is how I installed mace:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate
pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
git clone https://github.com/ACEsuit/mace.git
pip3 install ./mace

Then this is my submission script:
#!/bin/bash -l
#$ -cwd
#$ -P Free
#$ -A MCC_pool
#$ -l h_rt=00:10:00
#$ -l gpu=1
#$ -l mem=10G
#$ -N MACE

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load pytorch/1.11.0/gpu
module load compilers/gnu/4.9.2

source ~/py-venv/mace/bin/activate

~/py-venv/mace/bin/python3 ~/mace/scripts/create_lammps_model.py ~/MACE.model

The error message:

/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/nn/modules/container.py:487: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
Traceback (most recent call last):
File "/mace/scripts/create_lammps_model.py", line 10, in
lammps_model_compiled = jit.compile(lammps_model)
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
compile(
[Previous line repeated 3 more times]
File "/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 111, in compile
mod = torch.jit.script(mod, **script_options)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_script.py", line 1265, in script
return torch.jit._recursive.create_script_module(
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 454, in create_script_module
return create_script_module_impl(nn_module, concrete_type, stubs_fn)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 520, in create_script_module_impl
create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
File "/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 371, in create_methods_and_properties_from_stubs
concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
Only constant Sequential, ModueList, or ModuleDict can be used as an iterable:
File "/py-venv/mace/lib/python3.9/site-packages/mace/modules/symmetric_contraction.py", line 220
)
for i, (weight, contract_weights, contract_features) in enumerate(
zip(self.weights, self.contractions_weighting, self.contractions_features)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
):
c_tensor = contract_weights(

davkovacs · Answer 6 · Sat Nov 04 2023 00:19:19 GMT+0800 (China Standard Time)

Hi! Can you please make sure you are using the develop branch of mace for LAMMPS evaluation?

anarber · Answer 7 · Sat Nov 04 2023 00:27:13 GMT+0800 (China Standard Time)

[mace]$ git branch
*develop
main

Does that look right?

wcwitt · Answer 8 · Sat Nov 04 2023 00:30:12 GMT+0800 (China Standard Time)

Yes - that does look right, although this part (copied from above)

git clone https://github.com/ACEsuit/mace.git
pip3 install ./mace

suggests you installed main.

wcwitt · Answer 9 · Sat Nov 04 2023 00:32:08 GMT+0800 (China Standard Time)

I noticed the create_lammps_model.py script is no longer on the github

It's been moved to /mace/cli, but the instructions haven't been updated - thanks for noticing.

anarber · Answer 10 · Mon Nov 06 2023 17:41:51 GMT+0800 (China Standard Time)

Hi, I redid the installation using the same method as previously written with the exception of installing the develop branch with the command: git clone https://github.com/ACEsuit/mace.git --branch develop. Now git branch only returns develop; however, I am still getting the same error.

wcwitt · Answer 11 · Mon Nov 06 2023 18:23:04 GMT+0800 (China Standard Time)

Thanks for your patience. I'm not totally sure.

@ilyes319, torchscript seems to be complaining about this part:

for i, (weight, contract_weights, contract_features) in enumerate(
zip(self.weights, self.contractions_weighting, self.contractions_features)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
):
c_tensor = contract_weights(

Can you think of anything (here or elsewhere) that might have changed in recent merges to cause this?

Ilyes Batatia · Answer 12 · Mon Nov 06 2023 18:34:22 GMT+0800 (China Standard Time)

Mmmm this file has not changed in ages. It could be the pytorch version. What is your pytorch version @anarber ?

anarber · Answer 13 · Mon Nov 06 2023 18:47:26 GMT+0800 (China Standard Time)

$ pip3 show torch
Name: torch
Version: 2.1.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: ~/py-venv/mace/lib/python3.9/site-packages
Requires: sympy, nvidia-nvtx-cu12, fsspec, nvidia-cusolver-cu12, nvidia-cufft-cu12, triton, filelock, nvidia-cudnn-cu12, typing-extensions, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, jinja2, networkx, nvidia-cuda-runtime-cu12
Required-by: torch-ema, opt-einsum-fx, mace, e3nn

Ilyes Batatia · Answer 14 · Mon Nov 06 2023 18:49:05 GMT+0800 (China Standard Time)

I have never tried with the 2.1.0, can you try installing the 2.0?

anarber · Answer 15 · Mon Nov 06 2023 18:50:11 GMT+0800 (China Standard Time)

Sure, I'll give it a try. Thanks for your help!

anarber · Answer 16 · Mon Nov 06 2023 19:39:42 GMT+0800 (China Standard Time)

I'm getting the same error with torch 2.0.0.

Ilyes Batatia · Answer 17 · Mon Nov 06 2023 19:58:13 GMT+0800 (China Standard Time)

Can you run the test in test/test_models.py and see if you get the same error. Was your model trained with 2.1? If it is the case, can you try to train it with 2.0?

anarber · Answer 18 · Mon Nov 06 2023 21:26:22 GMT+0800 (China Standard Time)

Hi, so running test_models.py does not give any error when run with torch 2.0. Yes, the model was initially trained on 2.1, and I have just trained with 2.0 also without any errors.

Ilyes Batatia · Answer 19 · Mon Nov 06 2023 21:48:11 GMT+0800 (China Standard Time)

And the 2.0 model gives an error when compiling?

anarber · Answer 20 · Mon Nov 06 2023 21:52:17 GMT+0800 (China Standard Time)

Yes, it looks like the same error to me.

/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/nn/modules/container.py:487: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
Traceback (most recent call last):
  File "/home/mmm1217/mace/scripts/create_lammps_model.py", line 10, in <module>
    lammps_model_compiled = jit.compile(lammps_model)
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 99, in compile
    compile(
  [Previous line repeated 3 more times]
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/e3nn/util/jit.py", line 111, in compile
    mod = torch.jit.script(mod, **script_options)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_script.py", line 1265, in script
    return torch.jit._recursive.create_script_module(
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 454, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 520, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/lustre/shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/lib/python3.9/site-packages/torch/jit/_recursive.py", line 371, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
Only constant Sequential, ModueList, or ModuleDict can be used as an iterable:
  File "/home/mmm1217/py-venv/mace/lib/python3.9/site-packages/mace/modules/symmetric_contraction.py", line 220
        )
        for i, (weight, contract_weights, contract_features) in enumerate(
            zip(self.weights, self.contractions_weighting, self.contractions_features)
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        ):
            c_tensor = contract_weights(

Ilyes Batatia · Answer 21 · Mon Nov 06 2023 21:55:37 GMT+0800 (China Standard Time)

Can you give me a model so I can try to reproduce it.

anarber · Answer 22 · Mon Nov 06 2023 21:58:17 GMT+0800 (China Standard Time)

MACE_CsPbI3.model.zip
Hi, I have attached my model. Thanks!

Ilyes Batatia · Answer 23 · Mon Nov 06 2023 22:11:08 GMT+0800 (China Standard Time)

It works for me on my setup. Here is the compiled model. Maybe can you try to restart from a fresh environment, making sure it is a 2.0 torch model.
MACE_CsPbI3_jit.zip

anarber · Answer 24 · Mon Nov 06 2023 22:15:56 GMT+0800 (China Standard Time)

Just to confirm, does this installation method look correct? I will try one more time from scratch and let you know.

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate
pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
pip3 install torch==2.0.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/ACEsuit/mace.git --branch develop
pip3 install ./mace

Ilyes Batatia · Answer 25 · Mon Nov 06 2023 22:49:33 GMT+0800 (China Standard Time)

That looks good to me

anarber · Answer 26 · Tue Nov 07 2023 19:48:14 GMT+0800 (China Standard Time)

I managed to compile the LAMMPS model. I will copy my code for installing mace and running the script. There is still a pytorch warning that does not prevent the code from running, but I will include it below just so you are aware. Thanks for your help with this. I am happy for you to close the issue now.

Install mace:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load compilers/gnu/4.9.2

python3 -m venv ~/py-venv/mace
source ~/py-venv/mace/bin/activate

pip3 install numpy scipy matplotlib ase opt_einsum prettytable pandas e3nn
pip3 install torch==2.0.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

git clone https://github.com/ACEsuit/mace.git --branch develop
pip3 install ./mace

Run create_lammps_model.py:

module purge
module load beta-modules
module load gcc-libs/10.2.0
module load python/3.9.6-gnu-10.2.0 
module load cuda/11.3.1/gnu-10.2.0
module load cudnn/8.2.1.32/cuda-11.3
module load compilers/gnu/4.9.2

source /~/py-venv/mace/bin/activate

/~/py-venv/mace/bin/python3 /~/mace/scripts/create_lammps_model.py /~/MACE_CsPbI3.model

Remaining pytorch warning:

"MACE.e1187170" 2L, 372C
/~/py-venv/mace/lib/python3.9/site-packages/torch/jit/_check.py:172: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "

wcwitt · Answer 27 · Wed Nov 08 2023 04:07:04 GMT+0800 (China Standard Time)

Hi @anarber thanks - this kind of detailed reporting is really very helpful. Am I interpreting correctly that the main difference in your final attempt is to load the system cuda and cudann before creating the virtual environment? And is this on Young or your Oxford cluster?

anarber · Answer 28 · Wed Nov 08 2023 19:08:55 GMT+0800 (China Standard Time)

This is all on YOUNG. I did change how I set up the environment, but after more checks, it looks like the primary issue was with the slurm script. I was loading a module version of pytorch which was incompatible. After removing the pytorch import from slurm the create_lammps_module.py script works both in the current mace environement as well as in the previous version I had initially used. Another member of my group has used the current method to run mace successfully on YOUNG as well.

Ilyes Batatia · Answer 29 · Wed Nov 08 2023 19:20:07 GMT+0800 (China Standard Time)

Awesome that it works, I am closing now.