NVIDIA / TransformerEngine

Hi,

When I try the example located in docs/examples/te_llama with latest code, met below error:

/usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN18transformer_engine6getenvIiEET_RKSsRKS1

Same here.

Getting the same as well on release_v1.5. release_v1.4 works fine tho

@pggPL Could you take a look at this issue?

The issue happens because we recently included common/util/system.h into PyTorch source CPP files at https://github.com/NVIDIA/TransformerEngine/pull/713/files to utilize the transformer_engine::getenv method. As a result, the PyTorch transformer_engine_extensions will now search for symbols within the libtransformer_engine.so library.

However, if the user builds TransformerEngine in a local environment (conda or PyPI), because the pre-built PyTorch, either from conda or PyPI, has the CXX11_ABI set as False, whereas the common library is using the new CXX11_ABI. As a result, there will be an undefined symbol problem when importing transformer_engine_extensions. NGC PyTorch container is using CXX11_ABI=1, so there is no problem.

The following could be a possible fix: compile system.cpp directly to avoid linking directly with libtransformer_engine.so. Please let me know if this makes sense, and I could open a PR for it.

TransformerEngine/setup.py

Lines 443 to 449 in bfe21c3

    
           src_dir = root_path / "transformer_engine" / "pytorch" / "csrc" 
        
           extensions_dir = src_dir / "extensions" 
        
           sources = [ 
        
               src_dir / "common.cu", 
        
               src_dir / "ts_fp8_op.cpp", 
        
           ] + \ 
        
           _all_files_in_dir(extensions_dir)

possible fix:

    sources = [
        src_dir / "common.cu",
        src_dir / "ts_fp8_op.cpp",
        # We need to compile system.cpp because the extension uses transformer_engine::getenv.
        # This is a workaround to avoid linking directly with libtransformer_engine.so.
        root_path / "transformer_engine" / "common" / "util" / "system.cpp",
    ] + \
    _all_files_in_dir(extensions_dir)

Another possible solution is to define the transformer_engine::getenv function within the header file.

Ah, ABI issues make things challenging. If ABI incompatibility breaks transformer_engine::getenv, there's no reason to believe we can rely on any other C++ interfaces between libtransformer_engine.so and the PyTorch extension. This makes me think the right solution is to make the following refactors:

Only use C or header-only C++ functions from the common library in the PyTorch extensions.
If a C++ function in the common library can't be made header-only, its source code should be duplicated within the PyTorch extensions.

Ugly.

Similar logic would also apply to the Paddle extension, which has a similar build system as PyTorch.

@timmoon10 Thanks for the comments and proposed solution, I agree with it.
Another possible solution, for example, could we choose to build libtransformer_engine.so with the same CXX11_ABI as pytorch (torch._C._GLIBCXX_USE_CXX11_ABI) to ensure ABI compatibility?

	src_dir = root_path / "transformer_engine" / "pytorch" / "csrc"
	extensions_dir = src_dir / "extensions"
	sources = [
	src_dir / "common.cu",
	src_dir / "ts_fp8_op.cpp",
	] + \
	_all_files_in_dir(extensions_dir)

_ZN18transformer_engine6getenvIiEET_RKSsRKS1_ on the latest main branch