nikhilbarhate99 / Hierarchical-Actor-Critic-HAC-PyTorch

PyTorch implementation of Hierarchical Actor Critic (HAC) for OpenAI gym environments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

action and state offsets?

drozzy opened this issue · comments

I am just curious what do the action/state offset values mean?

https://github.com/nikhilbarhate99/Hierarchical-Actor-Critic-HAC-PyTorch/blob/master/train.py#L36

I can't seem to figure it out. How do you determine them, for example, for a new environment?

Similarly, for the clip low/high values for both action and states? If you could explain those as well I would appreciate it.

Thank you.

The action and state space for a lot of environments are NOT normalised between (-1, 1), but we still need to some how bound the output values of the neural network, so a Tanh activation function at the end of the network does not sufficiently bound the output because the spaces are not normalised.

So the actions given to the environment are modified accordingly:
action = ( network output (Tanh) * bounds ) + offset

For example, in mountain car continuous env:

the action space is between (-1, 1), and as the mean value [ (1 + (-1)) / 2 ] is 0 we do not require an offset, and the value of bound = 1, since our network only outputs between (-1, 1), so,

action = ( network output (Tanh) * bounds ) + offset
i.e action = (network output * 1) + 0

in HAC, the higher level policy also need to output a goal state, so we bound that in a similar way.
(here the output goal state is considered as action for the high level policy)

But the state space of mountain car continuous env is defined as [position, velocity] between min value = [-1.2, -0.07] and max value = [0.6, 0.07],

here the position variable (-1.2, 0.6) is NOT normalised to (-1,1) and its mean value [ (0.6 + (-1.2)) / 2 ] is 0.3

action = ( network output (Tanh) * bounds ) + offset

for position variable:
action = (network output * 0.9) + 0.3
this bounds the value of the action to (-1.2, 0.6)

similarly, the velocity variable (-0.07, 0.07) is NOT normalised to (-1,1) and its mean value [ (0.6 + (-1.2)) / 2 ] is
0, so,

for velocity variable:
action = (network output * 0.07) + 0
this bounds the value of the action to (-0.07, 0.07)

So, the net action is bound between min value = [-1.2, -0.07] and max value = [0.6, 0.07]

The clip high/low are simply the max and min values of the action space. We use this to clip the output after adding noise to ensure that the action values after adding noise does not exceed the environment bounds. These can be obtained easily by going through the documentation of the environment.

Thanks, that helps a lot!

What about the exploration_action_noise and exploration_state_noise values?
Are they derived from action/state spaces somehow?

No, the exploration_action_noise and exploration_state_noise are hyper parameters that need to be tuned by experimentation.

Thanks.