ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol

Question

ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol

Luoyu-Wang opened this issue a year ago · comments

Hi, after installing carefree-creator in docker, it reports an error when running, how to solve it?

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

carefree0910 · Answer 1 · Fri Jun 09 2023 13:03:57 GMT+0800 (China Standard Time)

Hello! According to your error message, it seems that my docker is a bit out of date considering the CUDA:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I'll try to update the Dockerfile and see if it can solve your problem!

carefree0910 · Answer 2 · Fri Jun 09 2023 13:13:24 GMT+0800 (China Standard Time)

Hi, I updated the Dockerfile just now, could you use the latest Dockerfile and try again? Thanks!

Luoyu-Wang · Answer 3 · Fri Jun 09 2023 15:31:08 GMT+0800 (China Standard Time)

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

carefree0910 · Answer 4 · Fri Jun 09 2023 16:17:34 GMT+0800 (China Standard Time)

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

what are the errors? And yes, you may need to delete the original Dockerfile and download the latest Dockerfile, and rebuild again!

Luoyu-Wang · Answer 5 · Fri Jun 09 2023 19:08:56 GMT+0800 (China Standard Time)

I rebuild again，but the errorS seem to be still there.

$ docker run --gpus all --rm -p 8123:8123 cfcreator:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

carefree0910 · Answer 6 · Fri Jun 09 2023 21:57:15 GMT+0800 (China Standard Time)

ok, i caught another potential mistake:

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

which means you restrict the memory of your docker to be 64MB, which may cause error. You can try the command it suggests you, or manually link your server's /dev/shm to the docker's /dev/shm, and see if it helps!

Luoyu-Wang · Answer 7 · Fri Jun 09 2023 22:24:02 GMT+0800 (China Standard Time)

Seems like another error?

$docker run --gpus 0 --rm -p 8123:8123 $TAG_NAME:latest --ipc=host --ulimit memlock=-1 --ulimit stack=67108864

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/opt/nvidia/nvidia_entrypoint.sh: line 49: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]

carefree0910 · Answer 8 · Fri Jun 09 2023 22:54:11 GMT+0800 (China Standard Time)

Maybe you need to put $TAG_NAME:latest at the end? because now it still complains 64MB ram, and it says your command is invalid (i'm not sure, i'm not an expert in docker commands either 🤣)

Luoyu-Wang · Answer 9 · Fri Jun 09 2023 23:26:51 GMT+0800 (China Standard Time)

Yes! You are right. But the ImportError seems also exist.

$ docker run --gpus 0 --rm -p 8123:8123 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 $TAG_NAME:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

carefree0910 · Answer 10 · Sun Jun 11 2023 10:39:57 GMT+0800 (China Standard Time)

Hmmm, here's my another guess: maybe the CUDA driver on your physics server (i.e., 510.108.03) or something like that is too low for the latest PyTorch. To verify it: can you run other pytorch2.0 projects on your server?

ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol

============= == PyTorch ==

============= == PyTorch ==

============= == PyTorch ==

=============
== PyTorch ==

=============
== PyTorch ==

=============
== PyTorch ==