RL communicating numbers

Overall idea

The idea is to frame the problem of learning the enumeration procedure as a communication problem between two agents. The agents need to communicate about the numerosity of two scenes they are separately observing, and cooperatively solve a task which requires to compute exactly the number of objects in the scenes, without being able to see both of them.

Environment

gym-like environment: env.obs(), env.step(action), env.reset(), env.reward()

Simplest observation: objects represented by binaries:

max_objects = 9
n_objects = random.randint(1,max_objects)
dim = 4

obs = np.zeros((dim,dim))
obs.ravel()[np.random.choice(obs.size, max_objects, replace=False)] = 1

obs:
array([[0., 0., 0., 1.],
 [1., 0., 0., 1.],
 [0., 0., 0., 0.],
 [1., 0., 0., 0.]])

Once both agents said 'stop' the scene-channel (physical environment) is replaced by the external representation of the other agent.
The external representation has the same dimensions as the scene-channel and is initialized with 0s.

Actions

moving left/right/up/down in scene-channel or external repr. depending in which phase of the episode we are
touch/draw: binary at the current position switches from 0 to 1 when external representation is on
stop
choose
answer

choose at the same time step as the answer.

(?) Start out with simple action space: touch, stop and choose, answer. --> they will likely start subitizing instead of touching each of the objects?. answer space: two units representing larger/smaller or equal/different.

NN-Architecture

ConvLSTM, CNN. Input, output dimensions dependent on obs-and action space. Any restrictions on the output? only one action at a time (-->softmax)? Otherwise letting the output freely take values between 0 and 1 e.g. with a sigmoid activation fct.

RL-algorithm

Start with an rl-algorithm that is simple to implement. E.g. DQL for multi-agents?

Task reward

Start-out: environment contains either 0 or 1 objects (presence/absence):

Final reward: when answer-action was correct.

What does it mean if one agent fails to answer correctly? wrong repr. of either of the two agents. or perfectly fine representations but wrong inference.

Punishment for action 'choose' before 'stop'. Or having the same node for them. Prefer the second option in the beginning.

Later: enviroment can contain multiple objects

Final reward: when answer-action was correct.

Intermediate rewards: touching objects (?), no number words.

flavio2018 / counting-agents