[Question] Observation problems in Pendulum-v1

Question

[Question] Observation problems in Pendulum-v1

Ian-Sy-Zhang opened this issue 3 months ago · comments

Question

From Document of Gymnasium we can know that:
the 0th item in Observation Space is 'x = cos(theta)'
the 1st item in Observatin Space is 'y = sin(angle)'

I didn't see anything in the document saying that 'theta' and 'angle' are two different things.
If theta is the same thing with angle, then x^2 + y^2 should be equal to 1.

import gymnasium as gym
import numpy as np

env = gym.make('Pendulum-v1')


incorrect_count = 0
for _ in range(100):
    state = env.observation_space.sample()

    cos_theta = state[0]
    sin_theta = state[1]

    sum_of_squares = cos_theta**2 + sin_theta**2

    print(f"Sum of squares: {sum_of_squares}")
    if np.isclose(sum_of_squares, 1.0, atol=0.1):
        print("Sample is correct.")
    else:
        print("Sample is incorrect.")
        incorrect_count += 1

print(incorrect_count)

The result shows that in 100 samples, 78 are incorrect.

So the questions are:

Is 'theta' the same defination of 'angle' in the document?
If the answer of question1 is 'yes', then why sin(\theta)^2 + cos(\theta)^2 != 1?
If the answer of question1 & question2 is 'yes', is there any problems in the sample function?

Mark Towers · Answer 1 · Wed Feb 28 2024 17:51:50 GMT+0800 (China Standard Time)

Yes, theta is the same as angle in the documentation
2 and 3. To generate an observation you are using env.observation_space.sample() however all this produces is a possible observation within the bounds, not necessarily a valid observation for the environment. Therefore, it doesn't necessarily generate an observation that follows the trig identity function.

Correct code

env = gym.make("Pendulum-v1")
obs, _ = env.reset()
assert np.isclose(obs[0]**2 + obs[1]**2, 1)
for _ in range(100):
    action = env.action_space.sample()

    obs, _, _, _, _ = env.step(env.action_space.sample())
    assert np.isclose(obs[0]**2 + obs[1]**2, 1)

Hong'aoZHU0611 · Answer 2 · Tue Apr 30 2024 15:58:47 GMT+0800 (China Standard Time)

May I ask what rewards make the best convergence? Mine using A3C found it hard to surpass -200 (for episodes no more than 200 steps).

Mark Towers · Answer 3 · Tue Apr 30 2024 17:28:11 GMT+0800 (China Standard Time)

Pendulum is a difficult exploration problem such that you might need to explore the environment more