[BUG] Enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm=True, ...) leads to error

Question

[BUG] Enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm=True, ...) leads to error

wenxuhaskell opened this issue 7 months ago · comments

Describe the bug
When enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm_True, ...), error takes place when building the model.

I found this error when trying to customize the VectorEncoderFactory(). But I modified distributed_offline_learning.py to reproduce the same error and hope it to be easier for you to reproduce it and investigate it.

But the error has nothing to do with distributed training.

To Reproduce

In distributed_offline_learning.py, do the changes as below,

    print(f"device: {device}")

    my_encoder_factory = d3rlpy.models.encoders.VectorEncoderFactory(hidden_units=[128,64,32], use_batch_norm=True)
    # setup algorithm
    cql = d3rlpy.algos.CQLConfig(
        actor_learning_rate=1e-3,
        critic_learning_rate=1e-3,
        alpha_learning_rate=1e-3,
        actor_encoder_factory=my_encoder_factory,
        critic_encoder_factory=my_encoder_factory
    ).create(device=device)

    # prepare dataset

Then run the command below (using 1 process only),

root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py

** Terminal output **
root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Start running on rank=0.
device: cuda:0
2024-01-18 07:01.56 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)])
2024-01-18 07:01.56 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 07:01.56 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 07:01.56 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 07:01.56 [info ] Directory is created at d3rlpy_logs/CQL_20240118070156 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 07:01.56 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
Traceback (most recent call last):
File "/home/code/xxx/distributed_offline_training.py", line 65, in
main()
File "/home/code/xxx/distributed_offline_training.py", line 51, in main
cql.fit(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 400, in fit
results = list(
^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 491, in fitter
self.create_impl(observation_shape, action_size)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/base.py", line 311, in create_impl
self.inner_create_impl(observation_shape, action_size)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/cql.py", line 137, in inner_create_impl
policy = create_normal_policy(
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/builders.py", line 170, in create_normal_policy
hidden_size = compute_output_size([observation_shape], encoder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 288, in compute_output_size
y = encoder(*inputs)
^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 28, in call
return super().call(x)
^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 223, in forward
return self._layers(x)
^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2448, in batch_norm
_verify_batch_size(input.size())
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2416, in _verify_batch_size
raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128])
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 162984) of binary: /root/.pyenv/versions/3.11.5/bin/python3.11
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.5/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

distributed_offline_training.py Failed

Takuma Seno · Answer 1 · Thu Jan 18 2024 19:13:33 GMT+0800 (China Standard Time)

@wenxuhaskell Hi, thanks for the issue! I think I've fixed this issue at this latest commit: 11adcee . If you pull the latest master, the issue should be resolved. Good catch!

wenxuhaskell · Answer 2 · Thu Jan 18 2024 20:51:35 GMT+0800 (China Standard Time)

@takuseno The latest master caused some runtime error, but I am not sure if it is a bug or any inconsistency to my environment (i.e., pytorch version).

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The code for experiment is still the same,

print(f"device: {device}")

    my_encoder_factory = d3rlpy.models.encoders.VectorEncoderFactory(hidden_units=[128,64,32], use_batch_norm=True)
    # setup algorithm
    cql = d3rlpy.algos.CQLConfig(
        actor_learning_rate=1e-3,
        critic_learning_rate=1e-3,
        alpha_learning_rate=1e-3,
        actor_encoder_factory=my_encoder_factory,
        critic_encoder_factory=my_encoder_factory
    ).create(device=device)

    # prepare dataset

The terminal output is pasted as below,

root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Start running on rank=0.
device: cuda:0
2024-01-18 14:58.51 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)])
2024-01-18 14:58.51 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] Directory is created at d3rlpy_logs/CQL_20240118145851 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.52 [debug ] Models have been built. distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.52 [info ] Parameters distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) params={'observation_shape': [3], 'action_size': 1, 'config': {'type': 'cql', 'params': {'batch_size': 256, 'gamma': 0.99, 'observation_scaler': {'type': 'none', 'params': {}}, 'action_scaler': {'type': 'none', 'params': {}}, 'reward_scaler': {'type': 'none', 'params': {}}, 'actor_learning_rate': 0.001, 'critic_learning_rate': 0.001, 'temp_learning_rate': 0.0001, 'alpha_learning_rate': 0.001, 'actor_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'critic_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'temp_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'alpha_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'actor_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'critic_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'q_func_factory': {'type': 'mean', 'params': {'share_encoder': False}}, 'tau': 0.005, 'n_critics': 2, 'initial_temperature': 1.0, 'initial_alpha': 1.0, 'alpha_threshold': 10.0, 'conservative_weight': 5.0, 'n_action_samples': 10, 'soft_q_backup': False}}}
Epoch 1/10: 0%| | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/code/xxx/distributed_offline_training.py", line 65, in
main()
File "/home/code/xxx/distributed_offline_training.py", line 51, in main
cql.fit(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 400, in fit
results = list(
^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 527, in fitter
loss = self.update(batch)
^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 828, in update
loss = self._impl.update(torch_batch, self._grad_step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/torch_utility.py", line 365, in wrapper
return f(self, *args, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 66, in update
return self.inner_update(batch, grad_step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/torch/ddpg_impl.py", line 119, in inner_update
metrics.update(self.update_actor(batch, action))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/torch/ddpg_impl.py", line 109, in update_actor
loss.actor_loss.backward()
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 267792) of binary: /root/.pyenv/versions/3.11.5/bin/python3.11
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.5/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

distributed_offline_training.py FAILED

wqp89324 · Answer 3 · Thu Feb 22 2024 00:00:11 GMT+0800 (China Standard Time)

Updating to 2.4.0 fixed the bug for me~

Takuma Seno · Answer 4 · Sun Mar 03 2024 19:45:42 GMT+0800 (China Standard Time)

@wqp89324 Thanks for the check! There is a chance that you could get an error, depending on datasets. Feel free to reopen this issue if there is any further discussion.