fbgemm_gpu build failing on A100

Question

fbgemm_gpu build failing on A100

liligwu opened this issue a year ago · comments

The following error can be reproduced in a nvcr.io/nvidia/pytorch:22.12-py3 docker image by using Pytorch nightly build on CUDA11.7
FBGEMM commit: bf6c1ef

root@ixt-rack-61:/workspace/FBGEMM/fbgemm_gpu# python setup.py build
['setup.py', 'build']
args: Namespace(cpu_only=False, nvml_lib_path=None, package_name='fbgemm_gpu')
unknown: ['build']
CUDA CUB directory environment variable not set. Using default CUB location.
name: fbgemm_gpu
-- fbgemm_gpu building version: 0.3.1
[103/104] Linking CXX shared module fbgemm_gpu_py.so
FAILED: fbgemm_gpu_py.so
: && /usr/bin/c++ -fPIC -D_GLIBCXX_USE_CXX11_ABI=0 -O3 -DNDEBUG -shared -o fbgemm_gpu_py.so CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/cumem_utils.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/histogram_binning_calibration_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/jagged_tensor_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/layout_transform_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/permute_pooled_embedding_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/permute_pooled_embedding_ops_split.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/quantize_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/split_embeddings_cache_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/split_embeddings_utils.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/metric_ops.cu.o CMakeFiles/fbgemm_gpu_py.dir/src/embedding_inplace_update.cu.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_forward_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_forward_quantized_host_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_backward_dense_host_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check_host_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/permute_pooled_embedding_ops_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/cpu_utils.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/jagged_tensor_ops_autograd.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/jagged_tensor_ops_meta.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/jagged_tensor_ops_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/input_combine_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/layout_transform_ops_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/quantize_ops_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/embedding_inplace_update_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_forward_quantized_host.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_backward_dense_host.cpp.o CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check_host.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/cumem_utils_host.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/layout_transform_ops_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/permute_pooled_embedding_ops_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/permute_pooled_embedding_ops_split_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/quantize_ops_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/split_embeddings_utils.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/split_table_batched_embeddings.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/metric_ops_host.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/embedding_inplace_update_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/merge_pooled_embeddings_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/merge_pooled_embeddings_gpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/src/topology_utils.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_dense_weighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_dense_unweighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_quantized_split_unweighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_quantized_split_weighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_split_weighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_split_unweighted_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_indice_weights_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_dense_indice_weights_codegen_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_dense_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_dense_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adagrad_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adagrad_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adam_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adam_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_weight_decay_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_weight_decay_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_counter_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_counter_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_sgd_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_sgd_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lamb_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lamb_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lars_sgd_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lars_sgd_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_adam_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_adam_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_lamb_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_lamb_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_weight_decay_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_weight_decay_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_weighted_adagrad_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_weighted_adagrad_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_sgd_split_weighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_sgd_split_unweighted_cuda.cu.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_adagrad.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_adam.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad_with_counter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_sgd.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_lamb.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_lars_sgd.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_partial_rowwise_adam.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_partial_rowwise_lamb.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad_with_weight_decay.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad_with_counter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_weighted_adagrad.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_sgd.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_quantized_unweighted_codegen_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_forward_quantized_weighted_codegen_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_dense_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_adagrad_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adagrad_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_adam_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_adam_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_weight_decay_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_rowwise_adagrad_with_counter_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_rowwise_adagrad_with_counter_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_approx_sgd_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_approx_sgd_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_lamb_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lamb_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_lars_sgd_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_lars_sgd_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_partial_rowwise_adam_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_adam_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_partial_rowwise_lamb_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_partial_rowwise_lamb_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad_with_weight_decay_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_weight_decay_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_adagrad_with_counter_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_adagrad_with_counter_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_rowwise_weighted_adagrad_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_rowwise_weighted_adagrad_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_split_sgd_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/gen_embedding_backward_sgd_split_cpu.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64assembler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64builder.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64compiler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64emithelper.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64formatter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64func.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64instapi.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64instdb.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64operand.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/a64rapass.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/arm/armformatter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/archtraits.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/assembler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/builder.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/codeholder.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/codewriter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/compiler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/constpool.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/cpuinfo.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/emithelper.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/emitter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/emitterutils.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/environment.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/errorhandler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/formatter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/func.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/funcargscontext.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/globals.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/inst.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/jitallocator.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/jitruntime.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/logger.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/operand.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/osutils.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/ralocal.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/rapass.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/rastack.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/string.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/support.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/target.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/type.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/virtmem.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zone.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zonehash.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zonelist.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zonestack.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zonetree.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/core/zonevector.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86assembler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86builder.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86compiler.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86emithelper.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86formatter.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86func.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86instapi.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86instdb.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86operand.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/third_party/asmjit/src/asmjit/x86/x86rapass.cpp.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/EmbeddingSpMDM.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/EmbeddingSpMDMNBit.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/QuantUtils.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/RefImplementations.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/RowWiseSparseAdagradFused.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/SparseAdagrad.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/Utils.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/EmbeddingSpMDMAvx2.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/QuantUtilsAvx2.cc.o CMakeFiles/fbgemm_gpu_py.dir/workspace/FBGEMM/src/EmbeddingSpMDMAvx512.cc.o -L/usr/local/cuda-11.8/targets/x86_64-linux/lib -Wl,-rpath,/usr/local/cuda-11.8/lib64:/usr/local/lib/python3.8/dist-packages/torch/lib: /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so /usr/local/cuda-11.8/lib64/stubs/libcuda.so /usr/local/cuda-11.8/lib64/libnvrtc.so /usr/local/cuda-11.8/lib64/libnvToolsExt.so /usr/local/cuda-11.8/lib64/libcudart.so /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so /usr/local/cuda-11.8/lib64/stubs/libnvidia-ml.so -Wl,--no-as-needed,"/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed -lpthread -Wl,--no-as-needed,"/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so" -Wl,--as-needed /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so /usr/local/cuda-11.8/lib64/libcufft.so /usr/local/cuda-11.8/lib64/libcurand.so /usr/local/cuda-11.8/lib64/libcublas.so /usr/lib/x86_64-linux-gnu/libcudnn.so -Wl,--no-as-needed,"/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so" -Wl,--as-needed /usr/local/cuda-11.8/lib64/libnvToolsExt.so /usr/local/cuda-11.8/lib64/libcudart.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl && :
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crti.o: in function _init': (.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol gmon_start'
CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o: in function __cudaUnregisterBinaryUtil()': tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x7): relocation truncated to fit: R_X86_64_PC32 against .bss'
CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o: in function bounds_check_indices_cuda(at::Tensor&, at::Tensor&, at::Tensor&, long, at::Tensor&, c10::optional<at::Tensor>)': tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x2b8): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol vtable for c10::cuda::impl::CUDAGuardImpl' defined in .data.rel.ro._ZTVN3c104cuda4impl13CUDAGuardImplE[_ZTVN3c104cuda4impl13CUDAGuardImplE] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o
tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0xa92): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol c10::TensorImpl::size_custom(long) const' defined in .text._ZNK3c1010TensorImpl11size_customEl[_ZNK3c1010TensorImpl11size_customEl] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0xae2): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol c10::TensorImpl::size_custom(long) const' defined in .text._ZNK3c1010TensorImpl11size_customEl[_ZNK3c1010TensorImpl11size_customEl] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o
tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0xf8a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol void bounds_check_indices_kernel<long>(at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, long, at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, fbgemm_gpu::FixedDivisor)' defined in .text._Z27bounds_check_indices_kernelIlEvN2at27GenericPackedTensorAccessorIlLm1ENS0_17RestrictPtrTraitsEiEENS1_IT_Lm1ES2_iEES5_lS3_N10fbgemm_gpu12FixedDivisorE[_Z27bounds_check_indices_kernelIlEvN2at27GenericPackedTensorAccessorIlLm1ENS0_17RestrictPtrTraitsEiEENS1_IT_Lm1ES2_iEES5_lS3_N10fbgemm_gpu12FixedDivisorE] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x131a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol void bounds_check_indices_kernel(at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<int, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<int, 1ul, at::RestrictPtrTraits, int>, long, at::GenericPackedTensorAccessor<long, 1ul, at::RestrictPtrTraits, int>, fbgemm_gpu::FixedDivisor)' defined in .text._Z27bounds_check_indices_kernelIiEvN2at27GenericPackedTensorAccessorIlLm1ENS0_17RestrictPtrTraitsEiEENS1_IT_Lm1ES2_iEES5_lS3_N10fbgemm_gpu12FixedDivisorE[_Z27bounds_check_indices_kernelIiEvN2at27GenericPackedTensorAccessorIlLm1ENS0_17RestrictPtrTraitsEiEENS1_IT_Lm1ES2_iEES5_lS3_N10fbgemm_gpu12FixedDivisorE] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o
tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x1452): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol c10::TensorImpl::size_custom(long) const' defined in .text._ZNK3c1010TensorImpl11size_customEl[_ZNK3c1010TensorImpl11size_customEl] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x14bb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol vsnprintf@@GLIBC_2.2.5' defined in .text section in /lib/x86_64-linux-gnu/libc.so.6
tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x15aa): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol c10::TensorImpl::size_custom(long) const' defined in .text._ZNK3c1010TensorImpl11size_customEl[_ZNK3c1010TensorImpl11size_customEl] section in CMakeFiles/fbgemm_gpu_py.dir/codegen/embedding_bounds_check.cu.o tmpxft_0004a5f4_00000000-6_embedding_bounds_check.compute_90.cudafe1.cpp:(.text+0x15fa): additional relocation overflows omitted from the output fbgemm_gpu_py.so: PC-relative offset overflow in PLT entry for _ZNK3c1010TensorImpl4sizeEl'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/skbuild/setuptools_wrap.py", line 640, in setup
cmkr.make(make_args, install_target=cmake_install_target, env=env)
File "/usr/local/lib/python3.8/dist-packages/skbuild/cmaker.py", line 684, in make
self.make_impl(clargs=clargs, config=config, source_dir=source_dir, install_target=install_target, env=env)
File "/usr/local/lib/python3.8/dist-packages/skbuild/cmaker.py", line 715, in make_impl
raise SKBuildError(

An error occurred while building with CMake.
Command:
cmake --build . --target install --config Release --
Install target:
install
Source directory:
/workspace/FBGEMM/fbgemm_gpu
Working directory:
/workspace/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.8/cmake-build
Please check the install target is valid and see CMake's output for more information.

Shintaro Iwasaki · Answer 1 · Thu Jan 19 2023 10:52:34 GMT+0800 (China Standard Time)

@liligwu Thank you for reporting an issue!

I am still debugging the issue (sorry, compiling FBGEMM on a local laptop takes long), but perhaps setting "-DTORCH_CUDA_ARCH_LIST=7.0;8.0" would be needed for compilation; my guess is that the relocation issue happens because of a large binary supporting all the possible CUDA architectures. At least, the following worked on my laptop with a CUDA GPU (though this compilation on docker should not be affected by the underlying GPUs).

It would be really helpful to identify an issue if you could check the following on your A100 machine.

# After installing nvidia-docker2 etc on an NVIDIA GPU machine.
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.12-py3
...
# On a Docker container
git clone https://github.com/pytorch/FBGEMM.git
cd FBGEMM/fbgemm_gpu
# Here the commit ID is bf6c1ef (the same as yours)

git submodule update --init --recursive
pip install -r requirements.txt
# Here, please set "-DTORCH_CUDA_ARCH_LIST=7.0;8.0" for Volta and Ampere
python3 setup.py build "-DTORCH_CUDA_ARCH_LIST=7.0;8.0"

Li Li · Answer 2 · Thu Jan 19 2023 12:16:22 GMT+0800 (China Standard Time)

@liligwu Thank you for reporting an issue!

I am still debugging the issue (sorry, compiling FBGEMM on a local laptop takes long), but perhaps setting "-DTORCH_CUDA_ARCH_LIST=7.0;8.0" would be needed for compilation; my guess is that the relocation issue happens because of a large binary supporting all the possible CUDA architectures. At least, the following worked on my laptop with a CUDA GPU (though this compilation on docker should not be affected by the underlying GPUs).

It would be really helpful to identify an issue if you could check the following on your A100 machine.
# After installing nvidia-docker2 etc on an NVIDIA GPU machine.
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.12-py3
...
# On a Docker container
git clone https://github.com/pytorch/FBGEMM.git
cd FBGEMM/fbgemm_gpu
# Here the commit ID is bf6c1ef (the same as yours)

git submodule update --init --recursive
pip install -r requirements.txt
# Here, please set "-DTORCH_CUDA_ARCH_LIST=7.0;8.0" for Volta and Ampere
python3 setup.py build "-DTORCH_CUDA_ARCH_LIST=7.0;8.0"

Hi Shintaro, the architecture flag helps. Thank you for your reply and help.

Shintaro Iwasaki · Answer 3 · Fri Jan 20 2023 05:08:16 GMT+0800 (China Standard Time)

@liligwu Thanks for your update. We confirmed that, without "-DTORCH_CUDA_ARCH_LIST=7.0;8.0", the binary size gets too large and causes the compilation error (relocation etc). We will update either README.md to clarify that this option is now needed or CMakeLists.txt` to set fewer architectures by default.