Large indoor environments
ArtiKitten opened this issue · comments
Hi, I'm currently working on reconstructing large indoor environments. From what I understood, the batch size is what's important in the reconstruction of this kind of scene.
I'm testing the setup with the meeting room from Tanks and Temples with a down sample of 30 (~370 images).
My first try is on a Quadro8000, so 48GB of memory, and with the recommended config (dict_size=22, dim=8, batch_size=16), it won't even start training. The only way the training doesn't fail is by running it with a batch_size of 4 and it does 1.25 it/s. Achieving 500 000 iterations would take quite literally almost a week of training. It then crashed at iteration 10 000, the first checkpoint.
Epoch: 107, total time: 112.668755.
Epoch: 108, total time: 111.452045.
Evaluating with 4 samples.
Training epoch 109: 68%|███████████████████████████▍ | 63/92 [01:52<00:23, 1.23it/s, iter=1e+4][2023-11-16 13:49:58,752] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 1100) of binary: /home/devops/miniconda3/envs/neuralangelo-rl/bin/python
Traceback (most recent call last):
File "/home/devops/miniconda3/envs/neuralangelo-rl/bin/torchrun", line 10, in <module>
sys.exit(main())
File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-16_13:49:58
host : chercheurs28.cdrin.com
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1100)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1100
=====================================================
And for the configuration file,
checkpoint:
save_epoch: 9999999999
save_iter: 10000
save_latest_iter: 9999999999
save_period: 9999999999
strict_resume: true
cudnn:
benchmark: true
deterministic: false
data:
name: dummy
num_images: null
num_workers: 4
preload: true
readjust:
center:
- 0.0
- 0.0
- 0.0
scale: 1.0
root: datasets/meeting_room_ds30
train:
batch_size: 4
image_size:
- 2174
- 3931
subset: null
type: projects.neuralangelo.data
use_multi_epoch_loader: true
val:
batch_size: 4
image_size:
- 300
- 542
max_viz_samples: 16
subset: 4
image_save_iter: 9999999999
inference_args: {}
local_rank: 0
logdir: logs/indoor/meeting_room
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 1000000
metrics_epoch: null
metrics_iter: null
model:
appear_embed:
dim: 8
enabled: false
background:
enabled: false
encoding:
levels: 10
type: fourier
encoding_view:
levels: 3
type: spherical
mlp:
activ: relu
activ_density: softplus
activ_density_params: {}
activ_params: {}
hidden_dim: 256
hidden_dim_rgb: 128
num_layers: 8
num_layers_rgb: 2
skip:
- 4
skip_rgb: []
view_dep: true
white: false
object:
rgb:
encoding_view:
levels: 3
type: spherical
mlp:
activ: relu_
activ_params: {}
hidden_dim: 256
num_layers: 4
skip: []
weight_norm: true
mode: idr
s_var:
anneal_end: 0.1
init_val: 3.0
sdf:
encoding:
coarse2fine:
enabled: true
init_active_level: 8
step: 5000
hashgrid:
dict_size: 22
dim: 8
max_logres: 11
min_logres: 5
range:
- -2
- 2
levels: 16
type: hashgrid
gradient:
mode: numerical
taps: 4
mlp:
activ: softplus
activ_params:
beta: 100
geometric_init: true
hidden_dim: 256
inside_out: true
num_layers: 1
out_bias: 0.5
skip: []
weight_norm: true
render:
num_sample_hierarchy: 4
num_samples:
background: 0
coarse: 64
fine: 16
rand_rays: 512
stratified: true
type: projects.neuralangelo.model
nvtx_profile: false
optim:
fused_opt: false
params:
lr: 0.001
weight_decay: 0.01
sched:
gamma: 10.0
iteration_mode: true
step_size: 9999999999
two_steps:
- 300000
- 400000
type: two_steps_with_warmup
warm_up_end: 5000
type: AdamW
pretrained_weight: null
source_filename: projects/neuralangelo/configs/custom/meeting_room.yaml
speed_benchmark: false
test_data:
name: dummy
num_workers: 0
test:
batch_size: 1
is_lmdb: false
roots: null
type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
amp_config:
backoff_factor: 0.5
enabled: false
growth_factor: 2.0
growth_interval: 2000
init_scale: 65536.0
ddp_config:
find_unused_parameters: false
static_graph: true
depth_vis_scale: 0.5
ema_config:
beta: 0.9999
enabled: false
load_ema_checkpoint: false
start_iteration: 0
grad_accum_iter: 1
image_to_tensorboard: false
init:
gain: null
type: none
loss_weight:
curvature: 0.0005
eikonal: 0.1
render: 1.0
type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100
That makes me wonder how we can achieve the same result as you show in the paper with large indoor environment? Do you have any example of a config file to achieve this and the corresponding GPUs?
Maybe I'm missing something and I don't understand what I'm doing? Maybe I just need $150k worth of GPUs?
Thanks for your help!
EDIT:
I had the chance to test 2x3090 gpus and I still can't train with 16 or even 8 as batch_size.
From what I understood, the batch size is what's important in the reconstruction of this kind of scene.
Maybe that's not exactly the point; the recommendations made are only related to the hyperparameters dict_size
and dim
of the hashgrid
.
Your configuration could be the one shown in projects/neuralangelo/configs/tnt.yaml
.
_parent_: projects/neuralangelo/configs/base.yaml
model:
object:
sdf:
mlp:
inside_out: False # True for Meetingroom.
encoding:
coarse2fine:
init_active_level: 8
appear_embed:
enabled: True
dim: 8
data:
type: projects.neuralangelo.data
root: datasets/tanks_and_temples/Barn
num_images: 410 # The number of training images.
train:
image_size: [835,1500]
batch_size: 1
subset:
val:
image_size: [300,540]
batch_size: 1
subset: 1
max_viz_samples: 16
In your case, setting inside_out = True
may also be helpful.
Take a look at this document for experimental details: Supplementary
The high batch size is what they mentionned using in the supplementary paper on the project, so 16 for T&T.
I trained meeting room on a 2x3090 setup during the whole weekend, for a total of 70h (250 000 iterations).
dict_size =22
dim=8
batch_size=2
Since I wasn't working during the weekend, I didn't notice the model stopped imporving at around 100k iterations.
As we can see, no difference after 100k.
Actually, I found what you mentioned in A. Additional Hyper-parameter section.
For the DTU benchmark, we follow prior work [14–16] and use a batch size of 1. For the Tanks and Temples dataset, we use a batch size of 16. We use the marching cubes algorithm [5] to convert predicted SDF to triangular meshes. The marching cubes resolution is set to 512 for the DTU benchmark following prior work [1, 14–16] and 2048 for the Tanks and Temples dataset
There are differences after 100k iterations, but perhaps not so representative.
If I were in your position, I would choose to merge the configuration you've already used but would also incorporate those adjustments I had mentioned earlier regarding the Signed Distance Function (SDF).
Please keep me updated of your results.
Best regards, Lucas.