_ZN18transformer_engine6getenvIiEET_RKSsRKS1_ on the latest main branch
leiwen83 opened this issue · comments
Hi,
When I try the example located in docs/examples/te_llama with latest code, met below error:
/usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN18transformer_engine6getenvIiEET_RKSsRKS1
Same here.
Getting the same as well on release_v1.5
. release_v1.4
works fine tho
@pggPL Could you take a look at this issue?
The issue happens because we recently included common/util/system.h
into PyTorch source CPP files at https://github.com/NVIDIA/TransformerEngine/pull/713/files to utilize the transformer_engine::getenv
method. As a result, the PyTorch transformer_engine_extensions
will now search for symbols within the libtransformer_engine.so
library.
However, if the user builds TransformerEngine in a local environment (conda or PyPI), because the pre-built PyTorch, either from conda or PyPI, has the CXX11_ABI set as False, whereas the common library is using the new CXX11_ABI. As a result, there will be an undefined symbol problem when importing transformer_engine_extensions
. NGC PyTorch container is using CXX11_ABI=1
, so there is no problem.
The following could be a possible fix: compile system.cpp directly to avoid linking directly with libtransformer_engine.so. Please let me know if this makes sense, and I could open a PR for it.
Lines 443 to 449 in bfe21c3
possible fix:
sources = [
src_dir / "common.cu",
src_dir / "ts_fp8_op.cpp",
# We need to compile system.cpp because the extension uses transformer_engine::getenv.
# This is a workaround to avoid linking directly with libtransformer_engine.so.
root_path / "transformer_engine" / "common" / "util" / "system.cpp",
] + \
_all_files_in_dir(extensions_dir)
Another possible solution is to define the transformer_engine::getenv
function within the header file.
Ah, ABI issues make things challenging. If ABI incompatibility breaks transformer_engine::getenv
, there's no reason to believe we can rely on any other C++ interfaces between libtransformer_engine.so
and the PyTorch extension. This makes me think the right solution is to make the following refactors:
- Only use C or header-only C++ functions from the common library in the PyTorch extensions.
- If a C++ function in the common library can't be made header-only, its source code should be duplicated within the PyTorch extensions.
Ugly.
Similar logic would also apply to the Paddle extension, which has a similar build system as PyTorch.
@timmoon10 Thanks for the comments and proposed solution, I agree with it.
Another possible solution, for example, could we choose to build libtransformer_engine.so
with the same CXX11_ABI as pytorch (torch._C._GLIBCXX_USE_CXX11_ABI
) to ensure ABI compatibility?