evolution-strategies evolutionary-algorithms reinforcement-learning gym openai-gym implementation-of-research-paper

Evolution Strategies OpenAI

Implementation is strictly for educational purposes and not distributed (as in paper), but it works.

Example

from training import run_experiment, render_policy

example_config = {
    "experiment_name": "test_BipedalWalker_v0",
    "plot_path": "plots/",
    "model_path": "models/", # optional
    "log_path": "logs/", # optional
    "init_model": "models/test_BipedalWalker_v5.0.pkl",  # optional
    "env": "BipedalWalker-v3",
    "n_sessions": 128,
    "env_steps": 1600, 
    "population_size": 256,
    "learning_rate": 0.06,
    "noise_std": 0.1,
    "noise_decay": 0.99, # optional
    "lr_decay": 1.0, # optional
    "decay_step": 20, # optional
    "eval_step": 10, 
    "hidden_sizes": (40, 40)
  }

policy = run_experiment(example_config, n_jobs=4, verbose=True)

# to render policy perfomance
render_policy(model_path, env_name, n_videos=10)

Implemented

OpenAI ES algorithm [Algorithm 1].
Z-normalization fitness shaping (not rank-based).
Parallelization with joblib.
Training for 6 OpenAI gym envs (3 solved).
Simple three layer net as policy example.
Learning rate & noise std decay.

Experiments

CartPole

Solved quickly and easily, especially if the population size is increased. However it is necessary to control the learning rate: it is better to put it less, as well as noise std: in this task there is no need to explore, it is enough to get a lot of feedback as a reward for natural gradient estimation.

LunarLander

As in the previous task, the algorithm is doing well, it is also important to set a small learning rate, but slightly increase nose std.

LunarLanderContinuous

Continuous env is solved much faster and better, probably at the expense of more dense reward. It is also interesting that here the agent has learned to land faster, not to turn on the engines immediately, but only before landing.

MountainCarContinuous

Сan't solve it yet.

In the discrete version of env, the main problem is sparse reward, which is only given at the very end if you climb a hill. Since the agent does not have time for 200 iterations with the random weights to do so, the natural gradient turns out to be zero and the training is stuck. Solution: remove the 200 iteration limit and wait for the random agent to climb the mountain himself, getting the first reward :). However, this is not quite fair.

In continuous env, the main problem is the lack of exploration. The agent quickly (faster than climbing the hill) realizes that the best way is to stand still and get 0 reward, which is much higher than when moving.

Possible solution: novelity search. As a novelity function it is possible to take velocity, velocity * x_coord, or x_coord at the end of episode. Reward shaping may improve convergence for DQN/CEM methods, but in this case it does not produce better results.

BipedalWalker

Not solved yet. More iterations is needed.

References

Evolution Strategies as a Scalable Alternative to Reinforcement Learning (Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever)

About

implementation of "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" OpenAI paper

evolution-strategies evolutionary-algorithms reinforcement-learning gym openai-gym implementation-of-research-paper

Languages

Language:Python 100.0%