👨‍👩‍👧‍👦 MA Adaptive Project Record

Question

👨‍👩‍👧‍👦 MA Adaptive Project Record

jc-bao opened this issue a year ago · comments

Chaoyi Pan commented a year ago

Research Problem

Adapt to other policies (like cooperative lifting)
Share environment information (partial observation, physical states, interactions between two drones. )

Chaoyi Pan · Answer 1 · Fri Sep 01 2023 22:29:21 GMT+0800 (China Standard Time)

Week1

Investigate the research problem.

Key results:
- Engineering: Implement dual quadrotor rigid link transportation
- Engineering: Train the environment with centralized policy
Progress

bzx20 · Answer 2 · Mon Sep 25 2023 17:17:02 GMT+0800 (China Standard Time)

Literature Review

Learning Vision-based Pursuit-Evasion Robot Policies

Basics

Addressing the complex task of learning strategic robot behavior, particularly in pursuit-evasion interactions under real-world constraints.
Supervised learning: a fully-observable robot policy provides supervision for a partially-observable one.
The quality of the supervision for the partially-observable pursuer policy is found to depend on two critical factors: achieving a balance between diversity and optimality in the evader's behavior and considering the strength of modeling assumptions in the fully-observable policy.

Details

Fully-Observable Policy: future trajectory-->latent intent, together with relative state to produce $\pi^*$
Partially-Obserable Policy: estimate&action history-->imitate latent intent, together with estimate to produce $\pi^p$

bzx20 · Answer 3 · Mon Sep 25 2023 21:19:39 GMT+0800 (China Standard Time)

Learning Vision-based Pursuit-Evasion Robot Policies

Evader Policy

Random policy
Motion primitives MP are the Cartesian product of regularly discretized linear and angular velocities
MARL policy
$\pi^*(x^{rel},z_t)$
The evader trains against a pre-trained, fully-observable pursuer policy( use a curriculum set at where at each fixed iteration, the pursuer speed in increased)

bzx20 · Answer 4 · Fri Oct 20 2023 19:29:27 GMT+0800 (China Standard Time)

RMA Method Applied to Dualquad2d Env

TODO: Replace the observation with actions of the other agent; Write eval functions
Current result: Based on simple implementation of RMA

bzx20 · Answer 5 · Fri Nov 03 2023 20:26:04 GMT+0800 (China Standard Time)

RMA Method Applied to Dualquad2d Env

Render results update
run command python train.py --env dualquad2d --RMA

bzx20 · Answer 6 · Fri Nov 03 2023 22:27:23 GMT+0800 (China Standard Time)

RMA Method Applied to Dualquad2d Env

Record some revision:

render_fn to render the dualquad2d env, revise env.reset, controller.update_params

get_obs_paramsonly() TODO: How to include the future state? In the form of a future trajectory?

@partial(jax.jit, static_argnums=(0,))
    def get_obs_paramsonly(self, state: EnvStateDual2D, params: EnvParamsDual2D) -> chex.Array:
        ### TO BE REVISED
        obs_elements = [
            jnp.array(
                [
                    # mass
                    # (params.m - params.m_mean)/params.m_std,
                    # # action_scale
                    (params.action_scale - params.action_scale_mean)/params.action_scale_std,
                    # # 1st order alpha
                    # (params.alpha_bodyrate - params.alpha_bodyrate_mean)/params.alpha_bodyrate_std,
                    # object mass
                    (params.mo - params.mo_mean)/params.mo_std, 
                    # rope length
                    (params.l - params.l_mean) / params.l_std, 
                ]
            )
        ]  # tmp:3
        obs = jnp.concatenate(obs_elements, axis=-1)
        return obs

Now I have cut some of the params. And this is to be revised.

About future state
What kind of predefined policy?
Current: random for test, obviously bad results
Possible reference: random, MARL, game theory