carefree0910 / carefree-creator

AI magics meet Infinite draw board.

Home Page:https://creator.nolibox.com/guest

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol

Luoyu-Wang opened this issue · comments

Hi, after installing carefree-creator in docker, it reports an error when running, how to solve it?

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

Hello! According to your error message, it seems that my docker is a bit out of date considering the CUDA:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I'll try to update the Dockerfile and see if it can solve your problem!

Hi, I updated the Dockerfile just now, could you use the latest Dockerfile and try again? Thanks!

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

what are the errors? And yes, you may need to delete the original Dockerfile and download the latest Dockerfile, and rebuild again!

I rebuild again,but the errorS seem to be still there.

$ docker run --gpus all --rm -p 8123:8123 cfcreator:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

ok, i caught another potential mistake:

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

which means you restrict the memory of your docker to be 64MB, which may cause error. You can try the command it suggests you, or manually link your server's /dev/shm to the docker's /dev/shm, and see if it helps!

Seems like another error?

$docker run --gpus 0 --rm -p 8123:8123 $TAG_NAME:latest --ipc=host --ulimit memlock=-1 --ulimit stack=67108864

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/opt/nvidia/nvidia_entrypoint.sh: line 49: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]

Maybe you need to put $TAG_NAME:latest at the end? because now it still complains 64MB ram, and it says your command is invalid (i'm not sure, i'm not an expert in docker commands either 🤣)

Yes! You are right. But the ImportError seems also exist.

$ docker run --gpus 0 --rm -p 8123:8123 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 $TAG_NAME:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

Hmmm, here's my another guess: maybe the CUDA driver on your physics server (i.e., 510.108.03) or something like that is too low for the latest PyTorch. To verify it: can you run other pytorch2.0 projects on your server?