openai / maddpg

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"

Home Page:https://arxiv.org/pdf/1706.02275.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

It seems that the training is decentralized?

pengzhenghao opened this issue · comments

I have looked through the train.py and found that your guys provide each agent an trainer:

def get_trainers(env, num_adversaries, obs_shape_n, arglist):

    trainers = []

    model = mlp_model

    trainer = MADDPGAgentTrainer

    for i in range(num_adversaries):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.adv_policy=='ddpg')))

    for i in range(num_adversaries, env.n):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.good_policy=='ddpg')))

    return trainers

It seems that maybe I have wrong understanding. But decentralize training in my understanding mean using an identical model to learn the q function, so therefore I can't understanding why assign each agent a trainer(include the model in the trainer), since you get so many model to train rather than only one.

And I found that even you use Reuse=True in the setting of tf.variable_scope, but each model of the trainer of a agent has name like "agent_0/fully_connected/weights". That means all the weights and bias of the model are not exactly the same. Namely, agent_0 has it's own model, agent_1 has it's own model, ...

So how could you say your training of this multi-agents system is centralized?

Look forward to your reply! Thanks!

Hi,
It is true that each agent learns its own policy and Q-function. The training is centralized in the sense that the inputs to each Q function depend on the actions and observations of all the agents. Usually, for the training to be considered fully 'decentralized', each agent's policy and value are only functions of that agent's observation and actions. This is consistent with other papers in the literature (see e.g. the similar work on COMA, https://arxiv.org/pdf/1705.08926.pdf)