zhangyx96 / value-difference-model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

value-difference-model

Prepare the environment

git clone https://github.com/IrisLi17/value-difference-model
cd value-difference-model
conda create -n <your_name> python=3.5
conda activate <your_name>
pip install -r requirements.txt

Train the model

python run_model_based_rl.py trpo -env <env_name>

<env_name> must be one of half-cheetah, swimmer, snake, ant, humanoid.

half-cheetah, swimmer, snake take hours to converge. ant takes ~3 days to converge and suffers from segment fault time to time on my machine.

The logging folder is saved in data/local/<env_name>/<env_name>_DATETIME_0001 by default. progress.csv contains real_current_validation_cost which is the negative of the reward so far.

You can use tensorboard to monitor more intermediate result by:

tensorboard --logdir <tf_logging_dir> --port <port_number>

Also, you will need to set up ssh port forwarding to see tensorboard on your local machine.

Change training configuration

To switch between original dynamic loss definition and the two proposed losses, modify sandbox/thanard/me-trpo/params/params-<env>.json,

dynamics_opt_params/use_value and dynamics_opt_params/dvds_weighting are the most relevant.

original loss: use_value=False, dvds_weighting=False.

$L^{(1)}$: use_value=False, dvds_weighting=True.

$L^{(0)}$: use_value=True, dvds_weighting=False.

Visualize trained policy

Currently it cannot run on our server.

TODO: you will need to manually modify line 612 in model_based_rl.py to specify the path of saved model. See my comment there.

Afterwards, run:

python run_model_based_rl.py trpo -env <env_name> -perform

About

License:MIT License


Languages

Language:Python 96.4%Language:JavaScript 1.4%Language:HTML 0.8%Language:Ruby 0.6%Language:CSS 0.4%Language:Shell 0.2%Language:Mako 0.2%