Extended Environments
Empirically estimate how self-reflective a reinforcement learning agent is. This proof-of-concept library contains 25 so-called "extended environments" and infrastructure allowing you to run a reinforcement learning agent against those environments. Performing well on average across the space of all extended environments seems to require that an agent self-reflect about itself (see Theory below), therefore it should be possible to empirically estimate an agent's self-reflection ability by running it across a suite of benchmark extended environments such as those in this library.
Note: As this library is first-of-kind, we have made no attempt to optimize it. It is meant to serve more as a proof-of-concept. Rather than strenuously optimize the environments in the library, we have instead designed environments of theoretical interest. Measurements obtained from this library should not be used to make real-world policy-decisions related to self-reflection.
Theory
In an ordinary obstacle course, things happen based on what you do: step on a button and spikes appear, for example. Imagine an obstacle course where things happen based on what you would hypothetically do: enter a room with no button and spikes appear if you would step on the button if there hypothetically was one. Such an environment would be impossible to stage for a human participant, because it is impossible to determine what a human would hypothetically do in some counterfactual scenario. But if we have the source-code of an AI participant, then we can determine what that participant would do in hypothetical scenarios, and so we can put AI participants into such obstacle courses.
An extended environment is a reinforcement learning environment which is able to simulate the agent and use the results when it determines which rewards and observations to send to the agent. Although this is a departure from traditional RL environments (which are not able to simulate agents), nevertheless, a traditional RL agent does not require any extension in order to interact with an Extended Environment. Thus, Extended Environments can be used to benchmark RL agents in ways that traditional RL environments cannot.
If an agent does not self-reflect about its own actions, then an extended environment might be difficult for the agent to figure out. Therefore, our thesis is that self-reflection is needed for an agent to achieve good performance on average over the space of all extended environments. This would imply that by measuring how an agent performs across a battery of such environments, it is possible to empirically estimate how self-reflective an agent is.
Installation
Note: The library has been built and tested using Python 3.6, so we recommend using that version or later for running the library.
Install using pip
Just like all other python packages, we recommend installing ExtendedEnvironments in a virtualenv or a conda environment.
To install, cd
into the cloned repository and do a local pip install:
cd ExtendedEnvironments
pip install -e .
Optionally, if you wish to replicate the experiments in our paper, see extended_rl/experiments/InstallingExperimentPrerequisites.txt
for how to install Stable Baselines3 (needed for the DQN/A2C/PPO agents measured in our paper).
Documentation
See example.py
for an example where we define a simple agent-class and then estimate that the self-reflectiveness of instances of that class.
selfrefl_measure
The library's main function is:
from extended_rl import selfrefl_measure
selfrefl_measure(A, num_steps)
...where:
A
is an agent-class (see below)num_steps
is the number of steps to run the agent in each environment
This function returns the average reward-per-turn after running instances of A
in 25 extended environments and their opposites, running it for num_steps
steps in each environment. (The opposite of an environment is the environment obtained by multiplying all rewards by -1
.)
For finer-grain details about the average reward-per-turn on each environment, call selfrefl_benchmark
instead (it has the same signature as selfrefl_measure
but returns a dictionary telling what average reward-per-turn are achieved in each environment).
Agents
An agent-class is a Python class of the following form:
class A:
def __init__(self, **kwargs):
...
def act(self, obs):
...
return action
def train(self, o_prev, a, r, o_next):
...
...where, for the act
method:
- The intuition is that the method tells how the agent will act in response to a given observation.
obs
is an observation (a natural number belowself.n_obs
)- (
self.n_obs
will be set automatically when instances ofA
are placed in environments) action
is an action (a natural number belowself.n_actions
)- (
self.n_actions
will be set automatically when instances ofA
are placed in environments) ...and, for thetrain
method: - The intuition is that the agent has taken action
a
in response to observationo_prev
, received rewardr
for doing so, and this has caused the new observation to beo_next
; and the agent is to modify itself accordingly. o_prev
ando_next
are observations (natural numbers belowself.n_obs
)a
is an action (a natural number belowself.n_actions
)r
is a reward (a number)
For example, here is an agent-class whose agent instances take the first available action which has not previously yielded a punishment for the observation in question (or action 0 if there is no such action).
class SimpleAgent:
def __init__(self, **kwargs):
self.was_action_punished = {
(o, a): False
for o in range(self.n_obs)
for a in range(self.n_actions)
}
def act(self, obs):
for action in range(self.n_actions):
if not(self.was_action_punished[obs, action]):
return action
return 0
def train(self, o_prev, a, r, o_next):
if r < 0:
self.was_action_punished[o_prev, a] = True
Environments
An environment is a class of the following form:
class E:
n_actions, n_obs = M, N
def __init__(self, A):
...
def start(self):
...
return obs
def step(self, action):
...
return (reward, obs)
...where:
M
is a positive integer representing how many actions an agent may choose fromN
is a positive integer representing how many observations are possibleA
is an agent-class which can be used to instantiate copies of the agent, which copies can be used to simulate the agent in order to inspect the agent's behavior in hypothetical circumstancesobs
is an observation (a natural number belowself.n_obs
)action
is an action (a natural number belowself.n_actions
)reward
is a reward (a number)
See the extended_rl/environments
directory for many examples of environments.