Analysis MO-Hopper reward vector

Question

Analysis MO-Hopper reward vector

Kallinteris-Andreas opened this issue 5 months ago · comments

Kallinteris Andreas commented 5 months ago

This analysis is theoretical and backed up by tests

The multi-objective hopper's reward vector "Contains three elements"

$r_{forward}$
$c_{control}$
Height (instead of $r_{healhty}$)

I assume 3. Was done as a proxy to $r_{healthy}$ Because in environment version 4 (and earlier) the $r_{healthy}$
It was bugged https://github.com/Farama-Foundation/Gymnasium/issues/526, otherwise the hopper could not learn to balance

This has been fixed in version 5 And perhaps it will work with $r_{healthy}$ instead of height.

This is important because it will indicate how more complex environments should be designed, Like the ant and humanoid.(who have a healthy reward)

@LucasAlegre

Lucas Alegre · Answer 1 · Wed Mar 06 2024 02:56:04 GMT+0800 (China Standard Time)

Hi @Kallinteris-Andreas,

Actually, we add r_healthy to all objectives, see line 57 https://github.com/Farama-Foundation/MO-Gymnasium/blob/main/mo_gymnasium/envs/mujoco/hopper.py#L57

Lucas Alegre · Answer 2 · Wed Mar 06 2024 02:59:29 GMT+0800 (China Standard Time)

But I see we have the exact same bug in the original environments, the reward is always given in the last time step, even if the agent is unhealthy.

We will release our v5 of the environments as soon as Gymnasium v1.0 is out. Thanks!

Kallinteris Andreas · Answer 3 · Wed Mar 06 2024 04:01:37 GMT+0800 (China Standard Time)

Do you want to keep the torso's height as a reward element?

Lucas Alegre · Answer 4 · Wed Mar 06 2024 04:04:07 GMT+0800 (China Standard Time)

Do you want to keep the torso's height as a reward element?

Yes, the idea is that then you can have a range of policies that trade-off between jumping forward (x-axis) vs. jumping higher (z-axis).

Kallinteris Andreas · Answer 5 · Wed Mar 06 2024 16:52:11 GMT+0800 (China Standard Time)

Is your goal

1)to to learn a policy that maximizes Gymnasium/hopper return.
2) Or to learn a policy that maximizes a different return.

Because if it is the first The current reward vector does not make sense.

Lucas Alegre · Answer 6 · Wed Mar 06 2024 20:18:25 GMT+0800 (China Standard Time)

When the weight assigned to the third reward component is greater than zero, it is indeed a different return. The gymnasium's return is recovered when the weight assigned to the third objective is zero. Our goal in MORL is to learn policies for any linear combination of these three rewards.

Kallinteris Andreas · Answer 7 · Wed Mar 20 2024 15:58:59 GMT+0800 (China Standard Time)

With #92 you explain the mapping,
so I am closing this issue