lehduong/Job-Scheduling-with-Reinforcement-Learning

pytorch control-variates reinforcement-learning policy-gradient job-scheduling load-balance

Learning to Assign Credit in Input-driven Environment (LACIE) reduce the variance of estimation of advantages value in noisy MDP with hindsight distribution.

Input-driven MDP

Input-driven MDP are the Markov processes governed by not only agent's actions but also stochastic, exogenous input processes [1]. These environments have high variance inheritantly making it hard to learn optimal policy.

This repository implemented:

Input-dependence baseline as in proposed in [1].
Lacie - an algorithm that learn to weight the advantages of each rollout in hindsight with respect to future input sequences.

Install Dependencies

Install Pytorch

pip install torch torchvision

install Tensorflow 2

pip install tensorflow=2.2

or

pip install tensorflow-gpu=2.2

Install OpenAI baseline (Tensorflow 2 version)

git clone https://github.com/openai/baselines.git -b tf2 && \
cd baselines && \
pip install -e .

Note: I haven't tested the code on Tensorflow 1 yet but it should work as well.

Install Park Platform. I modified the platform slightly to make it compatible with OpenAI's baseline.

git clone https://github.com/lehduong/park &&\
cd park && \
pip install -e .

Run experiments

See scripts for examples.

Results:

Reward of A2C+Lacie (yellow) vs A2C (blue)

Value loss of A2C+Lacie (yellow) vs A2C (blue) during training:

Reference

[1] Variance Reduction for Reinforcement Learning in Input-Driven Environments.

Acknowledgement

The started code is based on ikostrikov's repository.

About

Learning in Noisy MDP (which is governed by stochastic, exogenous input processes) with input-dependent baseline

pytorch control-variates reinforcement-learning policy-gradient job-scheduling load-balance

Apache License 2.0

Languages

Language:Python 86.3%Language:Shell 13.7%