SSAGESLabs / PySAGES

Python Suite for Advanced General Ensemble Simulations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compatibility Issue on `midway3-0372`

yihengwuKP opened this issue · comments

Summary

Running openmm ABF example script alanine-dipeptide.py in midway3-0372 would lead to errors, whereas for other nodes under depablo-gpu, e.g. midway3-0301, midway3-0303, midway3-0299, it ran successfully. Perhaps someone with admin power @ndtrung81 can do more debugging.

Procedure to reproduce

ssh midway3-0372    #  I have a job running on it. haven't tested on interactive mode due to node availability
module unload pysages
module unload python
module unload gcc
module load pysages
cd ~/source_code/myPysages/examples/openmm/abf/
python alanine-dipeptide_openmm.py

and it throws:

2024-07-03 17:26:08.092095: W external/xla/xla/stream_executor/gpu/asm_compiler.cc:231] Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 9.0
2024-07-03 17:26:08.092155: W external/xla/xla/stream_executor/gpu/asm_compiler.cc:234] Used ptxas at ptxas
2024-07-03 17:26:08.121780: F external/xla/xla/service/gpu/gpu_executable.cc:322] Check failed: !info.content.empty() 
[1]    1637925 abort (core dumped)  python alanine-dipeptide_openmm.py

For debugging purpose, here is the output of python -c "import jax; jax.print_environment_info()"

jax:    0.4.13
jaxlib: 0.4.13
numpy:  1.24.4
python: 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:06:00)  [GCC 11.4.0]
jax.devices (4 total, 4 local): [gpu(id=0) gpu(id=1) gpu(id=2) gpu(id=3)]
process_count: 1

$ nvidia-smi
Wed Jul  3 17:27:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:1A:00.0 Off |                    0 |
| N/A   44C    P0              79W / 350W |    922MiB / 81559MiB |     12%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  | 00000000:1C:00.0 Off |                    0 |
| N/A   48C    P0              84W / 350W |    922MiB / 81559MiB |     17%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 PCIe               On  | 00000000:B4:00.0 Off |                    0 |
| N/A   79C    P0             346W / 350W |   6874MiB / 81559MiB |     99%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 PCIe               On  | 00000000:B6:00.0 Off |                    0 |
| N/A   41C    P0              77W / 350W |    922MiB / 81559MiB |     14%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1608287      C   python                                      454MiB |
|    0   N/A  N/A   1638225      C   python                                      454MiB |
|    1   N/A  N/A   1608286      C   python                                      454MiB |
|    1   N/A  N/A   1638225      C   python                                      454MiB |
|    2   N/A  N/A    928404      C   python3                                    5628MiB |
|    2   N/A  N/A   1638225      C   python                                      454MiB |
|    3   N/A  N/A   1608291      C   python                                      454MiB |
|    3   N/A  N/A   1638225      C   python                                      454MiB |
+---------------------------------------------------------------------------------------+

Ok, midway3-0372 is a H100 card as shown in the print out, it needs CUDA >= 12.0 whereas the default cuda version is 11.5 when we load the pysages module. Thank you @yzjin for helping out!