Question about value network
yuke93 opened this issue · comments
Hi Yuanming,
Thanks for releasing codes of this wonderful project!
I have a question about the value network. In net.py
, the new_value
is predicted by observing fake_output
and new_states
. Let s_t
denote fake_input
, and then fake_output
is s_{t+1}
. The new_states
contain the ation a_t
that transfers s_t
to s_{t+1}
. Therefore, it seems the codes are predicting Q(s_t, a_{t-1})
, Q(s_{t+1}, a_t)
rather than Q(s_t, a_t)
, Q(s_{t+1}, a_{t+1})
. If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!
Yu Ke