OpenMM Metal Plugin
This plugin adds the Metal platform that accelerates OpenMM on Metal 3 GPUs. It supports Apple, AMD, and Intel GPUs running macOS Ventura or higher. The current implementation uses Apple's OpenCL compatiblity layer (cl2Metal
) to translate OpenCL kernels to AIR. It focuses on patches for OpenMM's code base that improve performance on macOS. The plugin makes the patches easily accessible to most users, who would otherwise wait for them to be upstreamed. It also adds optimizations for Intel Macs that cannot be upstreamed into the main code base, for various reasons.
The Metal plugin may transition to using the Metal API directly. This would provide greater control over the CPU-side command encoding process, solving a long-standing latency bottleneck for 1000-atom systems. However, it removes mixed and double precision on Intel Macs. FP64 emulation is proven to work on Apple silicon, but is not a priority for molecular dynamics. A backward-compatible OpenCL compilation path may or may not be maintained.
Vendors:
- Apple Mac GPUs are supported in current and future releases.
- Apple iOS GPUs are not working yet, but running on iOS is an end goal.
- AMD and Intel GPUs are in long-term service mode. The v1.0.0 tag on GitHub works, but newer versions are failing tests on Intel Macs.
Performance
Expect a 2-4x speedup for large simulations.
Usage
In Finder, create an empty folder and right-click it. Select New Terminal at Folder
from the menu, which launches a Terminal window. Copy and enter the following commands one-by-one.
conda install -c conda-forge openmm
git clone https://github.com/openmm/openmm
cd openmm && git fetch && git checkout 8.1_branch
cd ../
git clone https://github.com/philipturner/openmm-metal
cd openmm-metal
# requires password
bash build.sh --install --quick-tests
Next, you will benchmark OpenCL against Metal. In your originally empty folder, enter openmm/examples
. Arrange the files by name and locate benchmark.py
. Then, open the file with TextEdit or Xcode. Modify the top of the script as shown below:
from __future__ import print_function
import openmm.app as app
import openmm as mm
import openmm.unit as unit
from datetime import datetime
import argparse
import os
# What you need to add:
mm.Platform.loadPluginsFromDirectory(mm.Platform.getDefaultPluginsDirectory())
# Rest of the code:
def cpuinfo():
"""Return CPU info"""
Press Cmd + S
to save. Back at the Terminal window, type the following commands:
cd ../
cd openmm
cd examples
python3 benchmark.py --test apoa1rf --seconds 15 --platform OpenCL
python3 benchmark.py --test apoa1rf --seconds 15 --platform HIP
OpenMM's current energy minimizer hard-codes checks for the CUDA
, OpenCL
, and HIP
platforms. The Metal backend is currently labeled HIP
everywhere to bypass this limitation. The plugin name will change to Metal
after the next OpenMM version release, which fixes the issue. To prevent source-breaking changes, check for both the HIP
and Metal
backends in your client code.
Sequential Throughput
Due to its dependency on OpenMM 8.0.0, the Metal plugin can't implement an optimization inside the main code base that speeds up small systems. Therefore, there is an environment variable that can force-disable the usage of nearest neighbor lists inside Metal kernels. Turning off the neighbor list can substantially improve simulation speed for systems with under 3000 atoms.
export OPENMM_METAL_USE_NEIGHBOR_LIST=0 # accepted, force-disables neighor list
export OPENMM_METAL_USE_NEIGHBOR_LIST=1 # accepted, forces usage of neighbor list
export OPENMM_METAL_USE_NEIGHBOR_LIST=2 # runtime crash
unset OPENMM_METAL_USE_NEIGHBOR_LIST # accepted, automatically chooses whether to use the list
Scaling behavior (Water Box, Amber Forcefield, PME):
Atoms | Force No List | Force Use List | Auto |
---|---|---|---|
6 | 1380 ns/day | 1120 ns/day | 1380 ns/day |
21 | 1260 ns/day | 1120 ns/day | 1260 ns/day |
96 | 1210 ns/day | 1040 ns/day | 1210 ns/day |
306 | 1010 ns/day | 860 ns/day | 1010 ns/day |
774 | 739 ns/day | 637 ns/day | 739 ns/day |
2661 | 490 ns/day | 438 ns/day | 490 ns/day |
4158 | 347 ns/day | 340 ns/day | 340 ns/day |
The above table is not a great example of the speedup possible by eliminating the sequential throughput bottleneck. Typically, this will provide an order of magnitude speedup. As in, not 18% speedup, but 18x speedup. Why that is not happening needs to be investigated further.
Profiling
The Metal plugin can use the OpenCL queue profiling API to extract performance data about kernels. This API prevents batching of commands into command buffers, and reports the GPU start and end time of each buffer. The results are written to stdout in the JSON format used by https://ui.perfetto.dev.
export OPENMM_METAL_PROFILE_KERNELS=0 # accepted, does not profile
export OPENMM_METAL_PROFILE_KERNELS=1 # accepted, prints data to the console
export OPENMM_METAL_PROFILE_KERNELS=2 # runtime crash
unset OPENMM_METAL_PROFILE_KERNELS # accepted, does not profile
Reducing Energy
By default, energy summation is serialized among a single threadgroup. The reduceEnergy
kernel consumes a significant proportion of execution time for small systems. You can make reduction occur across more than one threadgroup with the following variable. Note that this typically speeds up small systems and slows down large systems.
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=1 # accepted, 1 threadgroup used
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=3 # accepted, 3 threadgroups used
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=-1 # runtime crash
unset OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS # accepted, 1 threadgroup used
Scaling behavior (Water Box, Amber Forcefield, No Cutoff):
- top of table shows number of threadgroups
- cells show time to finish the
reduceEnergy
kernel
Atoms | OpenMM OpenCL | Metal TG=1 | Metal TG=2 | Metal TG=4 | Metal TG=8 |
---|---|---|---|---|---|
96 | ~65 µs | 9 µs | 7 µs | 6 µs | 6 µs |
306 | ~65 µs | 9 µs | 7 µs | - | - |
774 | ~65 µs | 9 µs | 7 µs | - | - |
2661 | ~65 µs | 9 µs | 7 µs | - | - |
4158 | ~65 µs | 9 µs | 7 µs | 6 µs | 6 µs |
You can increase the precision of energy measurements by setting METAL_REDUCE_ENERGY_THREADGROUPS
to a very large number. The GPU splits the energy into a large number of partial sums, which the CPU casts from FP32 -> FP64 and accumulates in mixed precision. An improvement in energy conservation has not yet been observed, because the tested system was too small to occupy many threadgroups. It is theorized to take effect with very large systems.
Scaling
At the several million atom range, OpenMM starts to experience
export OPENMM_METAL_USE_LARGE_BLOCKS=0 # accepted, force-disables large blocks
export OPENMM_METAL_USE_LARGE_BLOCKS=1 # accepted, forces usage of large blocks
export OPENMM_METAL_USE_LARGE_BLOCKS=2 # runtime crash
unset OPENMM_METAL_USE_LARGE_BLOCKS # accepted, uses large blocks
Scaling behavior (Water Box, Amber Forcefield, PME):
- s/100K/500 means "seconds per 100,000 atoms per 500 steps"
Water Box Sizes | Atoms | s/100K/500 (32-blocks) | s/100K/500 (1024-blocks) |
---|---|---|---|
5.0 nm | 12255 | 4.29 | 4.21 |
10.0 nm | 98880 | 2.02 | 2.02 |
15.0 nm | 335025 | 1.95 | 1.90 |
20.0 nm | 792450 | 2.03 | 1.92 |
25.0 nm | 1550160 | 2.35 | 2.09 |
30.0 nm | 2682600 | 2.61 | 2.32 |
Testing
Quick tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | 1.0.0 | 1.0.0 |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
CMAPTorsion | ✅ | ✅ | ✅ |
CMAPMotionRemover | ✅ | ✅ | ✅ |
CMMotionRemover | ✅ | ✅ | ✅ |
Checkpoints | ✅ | ✅ | ✅ |
CompoundIntegrator | ✅ | ✅ | ✅ |
CustomAngleForce | ✅ | ✅ | ✅ |
CustomBondForce | ✅ | ✅ | ✅ |
CustomCVForce | ✅ | ✅ | ✅ |
CustomCentroidBondForce | ✅ | ✅ | ✅ |
CustomCompoundBondForce | ✅ | ✅ | ✅ |
CustomExternalForce | ✅ | ✅ | ✅ |
CustomGBForce | ✅ | ✅ | ✅ |
CustomHbondForce | ✅ | ✅ | ✅ |
CustomTorsionForce | ✅ | ✅ | ✅ |
DeviceQuery | ✅ | ✅ | ✅ |
FFT | ✅ | ✅ | ❌ |
GBSAOBCForce | ✅ | ✅ | ✅ |
GayBerneForce | ✅ | ✅ | ❌ |
HarmonicAngleForce | ✅ | ✅ | ✅ |
HarmonicBondForce | ✅ | ✅ | ✅ |
MultipleForces | ✅ | ✅ | ✅ |
PeriodicTorsionForce | ✅ | ✅ | ✅ |
RBTorsionForce | ✅ | ✅ | ✅ |
RMSDForce | ✅ | ✅ | ✅ |
Random | ✅ | ✅ | ✅ |
Settle | ✅ | ✅ | ✅ |
Sort | ✅ | ✅ | ✅ |
VariableVerlet | ✅ | ✅ | ✅ |
AmoebaExtrapolatedPolarization | ✅ | ✅ | ❌ |
AmoebaGeneralizedKirkwoodForce | ✅ | ✅ | ❌ |
AmoebaMultipoleForce | ✅ | ✅ | ❌ |
AmoebaTorsionTorsionForce | ✅ | ✅ | ✅ |
HippoNonbondedForce | ✅ | ✅ | ❌ |
WcaDispersionForce | ✅ | ✅ | ✅ |
DrudeForce | ✅ | ✅ | ✅ |
Long tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | 1.0.0 | 1.0.0 |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
AndersenThermostat | ✅ | ✅ | ✅ |
CustomManyParticleForce | ✅ | ✅ | ✅ |
CustomNonbondedForce | ✅ | ✅ | ✅ |
DispersionPME | ✅ | ✅ | ❌ |
Ewald | ✅ | ✅ | ❌ |
LangevinIntegrator | ✅ | ✅ | ✅ |
LangevinMiddleIntegrator | ✅ | ✅ | ✅ |
LocalEnergyMinimizer | ✅ | ❌ | ✅ |
MonteCarloFlexibleBarostat | ✅ | ✅ | ✅ |
NonbondedForce | ✅ | ✅ | ❌ |
VerletIntegrator | ✅ | ✅ | ✅ |
VirtualSites | ✅ | ✅ | ✅ |
DrudeNoseHoover | ✅ | ✅ | ✅ |
Very long tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | n/a | n/a |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
BrownianIntegrator | ✅ | - | - |
CustomIntegrator | ✅ | - | - |
MonteCarloAnisotropicBarostat | ✅ | - | - |
MonteCarloBarostat | ✅ | - | - |
NoseHooverIntegrator | ✅ | - | - |
VariableLangevinIntegrator | ✅ | - | - |
DrudeLangevinIntegrator | ✅ | - | - |
DrudeSCFIntegrator | ✅ | - | - |
Roadmap
Releases:
- v1.0.0
- OpenMM 8.0.0
- Initial release, bringing proper support for Mac GPUs
- v1.1.0
- OpenMM 8.1.0
- AMD and Intel go into long-term service
- Remove "HIP" platform name workaround, introduce "Metal"
- Remove double and mixed precision
- v1.2.0
- OpenMM 8.2.0
- openmm/openmm#4346
- openmm/openmm#4348
License
The Metal Platform uses OpenMM API under the terms of the MIT License. A copy of this license may be found in the accompanying file MIT.txt.
The Metal Platform is based on the OpenCL Platform of OpenMM under the terms of the GNU Lesser General Public License. A copy of this license may be found in the accompanying file LGPL.txt. It in turn incorporates the terms of the GNU General Public License, which may be found in the accompanying file GPL.txt.
The Metal Platform uses VkFFT by Dmitrii Tolmachev under the terms of the MIT License. A copy of this license may be found in the accompanying file MIT-VkFFT.txt.