Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Home Page:https://gymnasium.farama.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Observation problems in Pendulum-v1

Ian-Sy-Zhang opened this issue · comments

Question

From Document of Gymnasium we can know that:
the 0th item in Observation Space is 'x = cos(theta)'
the 1st item in Observatin Space is 'y = sin(angle)'

I didn't see anything in the document saying that 'theta' and 'angle' are two different things.
If theta is the same thing with angle, then x^2 + y^2 should be equal to 1.

import gymnasium as gym
import numpy as np

env = gym.make('Pendulum-v1')


incorrect_count = 0
for _ in range(100):
    state = env.observation_space.sample()

    cos_theta = state[0]
    sin_theta = state[1]

    sum_of_squares = cos_theta**2 + sin_theta**2

    print(f"Sum of squares: {sum_of_squares}")
    if np.isclose(sum_of_squares, 1.0, atol=0.1):
        print("Sample is correct.")
    else:
        print("Sample is incorrect.")
        incorrect_count += 1

print(incorrect_count)

The result shows that in 100 samples, 78 are incorrect.

So the questions are:

  1. Is 'theta' the same defination of 'angle' in the document?
  2. If the answer of question1 is 'yes', then why sin(\theta)^2 + cos(\theta)^2 != 1?
  3. If the answer of question1 & question2 is 'yes', is there any problems in the sample function?
  1. Yes, theta is the same as angle in the documentation
    2 and 3. To generate an observation you are using env.observation_space.sample() however all this produces is a possible observation within the bounds, not necessarily a valid observation for the environment. Therefore, it doesn't necessarily generate an observation that follows the trig identity function.

Correct code

env = gym.make("Pendulum-v1")
obs, _ = env.reset()
assert np.isclose(obs[0]**2 + obs[1]**2, 1)
for _ in range(100):
    action = env.action_space.sample()

    obs, _, _, _, _ = env.step(env.action_space.sample())
    assert np.isclose(obs[0]**2 + obs[1]**2, 1)

May I ask what rewards make the best convergence? Mine using A3C found it hard to surpass -200 (for episodes no more than 200 steps).

Pendulum is a difficult exploration problem such that you might need to explore the environment more