philtabor / Actor-Critic-Methods-Paper-To-Code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why do you add the state_space and action_space?

QasimWani opened this issue · comments

I can't understand why you added the state input (relu) and the action space tensors together. I couldn't find a mention of this in the original paper on DDPG. Would appreciate some help!

state_action_value = F.relu(T.add(state_value, action_value))

So the action value function reflects the value of all possible actions for a given state. In the tabular case, Q learning for example, we would have a table where each row corresponds to a distinct state, and each row corresponds to the value of each action for that state. So,it should have dimensionality [n_states, n_actions]

Here we are dealing with neural networks, so it's a little more abstract. They have a batch dimension, that corresponds to the states from our tabular case (i.e. batch_size -> n_states], and should output something that has dimensions [batch_size, n_actions]. We have to take into account the relative value of each action, so how can we do that?

One option would be to concatenate the state and action values, but then you would end up with something that has dimensions [batch_size, n_actions + 1], so it doesn't really map to our tabular case. The physical interpretation of this quantity would also be questionable. What does it mean to append the value of a state and action? It doesn't really tell you the value of the combination of the two together, rather it's just telling you the value of both separately.

The addition operator, however, ends up with the correct dimensionality. We end up with shape [batch_size, n_actions] and it also has a straightforward physical interpretation. The sum of the action and state values represents the value each action adds for a given state. This makes it kind like an advantage for each action.

In the end, I think both work out, but I found the addition operator to be more natural. I encourage experimenting with the concatenation to see how it works out.

Thanks for that. It makes sense what you're saying. I'm using the concatenate option in the forward part of the Critic network. Also, based on what I understood, the reason why we're adding is because we're mapping (state, action) pair to a Q-value, deterministically. Therefore, we must add the state and action dimensions in the second fully connected layer (according to the paper).
Could you confirm if that makes sense?
Thanks and appreciate the quick response! :D

I just want to clarify something, because now that I'm rereading what I wrote I don't think it's clear. The output of our networks should be a single scalar value, not something with more than 1 dimension. The critic is evaluating the action the agent selected, and that's a scalar value.

From what you've said, I think your understanding is correct. We are indeed including the action at the second hidden layer of the network, per the paper.