how to install on headless AWS server?
jryebread opened this issue · comments
Hi, I am on an AWS server, I have succesfully installed pip install -r requirements.txt
but when I run train command on dataset I get error from nvdiffrast
, I think due to it being a headless server and not having OpenGL installed.
according to this issue: 3DTopia/LGM#38 I can run nvdiffrast with --force_cuda_rast
, but I'm not sure where to add this in the code, can you help me?
Thank you in advance.
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2
FULL ERROR BELOW
Optimizing output/UNION10EMOEXP_306_eval_600k
Output folder: output/UNION10EMOEXP_306_eval_600k [22/05 22:21:52]
/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/a/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ec2-user/GaussianAvatars/train.py", line 350, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "/home/ec2-user/GaussianAvatars/train.py", line 40, in training
mesh_renderer = NVDiffRenderer()
File "/home/ec2-user/GaussianAvatars/mesh_renderer/__init__.py", line 29, in __init__
self.glctx = dr.RasterizeGLContext() if use_opengl else dr.RasterizeCudaContext()
File "/opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 221, in __init__
self.cpp_wrapper = _get_plugin(gl=True).RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)
File "/opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 118, in _get_plugin
torch.utils.cpp_extension.load(name=plugin_name, sources=source_paths, extra_cflags=opts, extra_cuda_cflags=opts+['-lineinfo'], extra_ldflags=ldflags, with_cuda=True, verbose=False)
File "/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1309, in load
return _jit_compile(
File "/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/a/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'nvdiffrast_plugin_gl': [1/4] /opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF torch_rasterize_gl.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/torch_rasterize_gl.cpp -o torch_rasterize_gl.o
FAILED: torch_rasterize_gl.o
/opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF torch_rasterize_gl.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/torch_rasterize_gl.cpp -o torch_rasterize_gl.o
In file included from /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/../common/rasterize_gl.h:16,
from /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/torch_rasterize_gl.cpp:12:
/opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/torch/../common/glutil.h:36:10: fatal error: EGL/egl.h: No such file or directory
36 | #include <EGL/egl.h>
| ^~~~~~~~~~~
compilation terminated.
[2/4] /opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF glutil.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/glutil.cpp -o glutil.o
FAILED: glutil.o
/opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF glutil.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/glutil.cpp -o glutil.o
In file included from /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/glutil.cpp:14:
/opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/glutil.h:36:10: fatal error: EGL/egl.h: No such file or directory
36 | #include <EGL/egl.h>
| ^~~~~~~~~~~
compilation terminated.
[3/4] /opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF rasterize_gl.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/rasterize_gl.cpp -o rasterize_gl.o
FAILED: rasterize_gl.o
/opt/conda/envs/a/bin/x86_64-conda-linux-gnu-c++ -MMD -MF rasterize_gl.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/a/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /opt/conda/envs/a/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/rasterize_gl.cpp -o rasterize_gl.o
In file included from /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/rasterize_gl.h:16,
from /opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/rasterize_gl.cpp:9:
/opt/conda/envs/a/lib/python3.10/site-packages/nvdiffrast/common/glutil.h:36:10: fatal error: EGL/egl.h: No such file or directory
36 | #include <EGL/egl.h>
| ^~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.```
Given EGL/egl.h: No such file or directory
, you may try sudo apt-get install libegl1-mesa-dev
.
i am on centos (amazon linux) and looks like the equivalent is here: https://centos.pkgs.org/7/centos-x86_64/mesa-libGL-devel-18.3.4-10.el7.x86_64.rpm.html but after installing that it didn't work to fix the EGL error.. I guess I will need to try ubuntu instance instead..
When you ran tests for training the model was it on a headless server or normal linux pc? I'm worried I won't be able to get it to run at all on headless.
I often run this repo on a headless remote server with Ubuntu. It is also possible to run GUI with x11 forwarding.
Usually, I would first make glxgears
run on a machine to make sure OpenGL components are ready, then start setting up the repo.
hmm what happened to this repo? why has it been deleted?