Lamba Target Equation

Question

Lamba Target Equation

lewisboyd opened this issue 3 years ago · comments

Hi,

I have a question about how you calculate the lambda_target as seen in the equation below.

I've been implementing it to work directly in the environment rather than with the model states to test out how it works and something occurred to me. On your final step, i.e. when t = H, are you not accounting for the reward twice since the Value network is already trained to incorporate the reward of a state into the Value for a state? Would it not be more valid to instead stop calculation at H-1 and use the final H model_state only for bootstrapping, so that the target calculation would become V(s_H-1) = r_H-1 + y_H-1 * V(s_H)?

Thanks again,
Lewis

Danijar Hafner · Answer 1 · Fri Oct 22 2021 06:43:29 GMT+0800 (China Standard Time)

Yes, that's exactly what's happening. You can see that the equation says t < H not t <= H.

Lewis Boyd · Answer 2 · Fri Oct 22 2021 18:26:45 GMT+0800 (China Standard Time)

My concern is about the second line handling the if t=H part since you could rewrite it as V_H = r_H + y_H * v(s_H)

Sorry I think the way I wrote that originally was confusing since I wasn't distinguishing between the lambda target, V, and the value network, v.

Danijar Hafner · Answer 3 · Fri Oct 22 2021 23:21:44 GMT+0800 (China Standard Time)

Ah, I see. You're right that the equation isn't quite correct for the last time step. The implementation only uses the value at the last step, not the reward, as you suggested.

Lewis Boyd · Answer 4 · Sun Oct 24 2021 04:49:40 GMT+0800 (China Standard Time)

Okay cool thanks for clarifying! :)