Embracing the Value of Hallucinations: Neural Dyna Q-Learning for Easy Control Problems

(Status): This is still just a research proposal. Seems to be working, though. Need to clean up comments and produce report.

As descibed in Ref. 1, dyna-type reinforcement learning (RL) strategies greatly improve sample efficiency by learning from "simulated" experience. When a precise first-principles model is not available, these simulations are often implemented using data-driven models. In such cases, even small modelling errors can result in poor learning outcomes. These misleading simulated experiences can be considered "hallucinations" and, for certain applications, can be a big problem. We propose that the pitfalls of such hallucinations can avoided in automatic control problems through a novel structuring of the learning process.

By constraining our RL problem within the well-established principles of proportion-integral-derivative (PID) control, we show that our proposed approach is sufficiently robust to absorb modelling errors in a Dyna Q-Learning framework. In short, we reduce the complexity of the RL space to just filtering out really bad ideas, which is still feasible when the model isn't very good. We implement a simple Neural Network to model a dynamic agent in (oh gosh) real-time and use this model as the basis for dyna planning. An off-policy Dyna Q-Learning approach (similar to the on-policy Learning Automata approach at Ref. 2) is then used to tune the gain parameters for the agent controller.

References

Taher Jafferjee, Ehsan Imani, Erin J. Talvitie, Martha White, and Michael Bowling. Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models
Peter T. Jardine, Sidney N. Givigi, and Shahram Yousefi. Leveraging Data Engineering to Improve Unmanned Aerial Vehicle Control Design

Citing

The code is opensource but, if you reference this work in your own reserach, please cite me. I have provided an example bibtex citation below:

@techreport{Jardine-2021, title={Embracing the Value of Hallucinations: Neural Dyna Q-Learning for Easy Control Problems}, author={Jardine, P.T.}, year={2021}, institution={Royal Military College of Canada, Kingston, Ontario}, type={Research Proposal}, }

Alternatively, feel free to cite any of my related papers (published in more formal venues) listed in Google Scholar.

Preliminary Results: Neural Dyna Q-Learning

Here is a plot of the selected parameters converging. The vertical red line denotes when the Neural Dyna Q-learning kicks in. Note that the learning accelerates to the right of the red line; this illustrates the Dyna Q-learning is effective. Learning rate for Q-Learning is 0.005.

Development notes: Q-Learning

Here is an animation of the agent learning:

Note that the cost decreases with time. The noise is generated by the random motion of the target. In practice, what is important is that the variance of this noise decreases with time (because the agent is getting better at tracking the target as it moves).

As we reduce the exploration rate, the rewards grow with time (because the agent is exploiting the good parameters more often):

Development notes: Neural Network

Here is the neural network getting better at predicting the agent dynamics (i.e. error goes down). Each plot represents the progress of successive minibatches.

Here is a comparison of the neural network tracking ("ghost") versus the actual agent ("states") given the same inputs. State updates are provided between successive RL trials (i.e. ever 2 seconds).

Notice it kinda works, but is subject to larger modelling errors during agressive manoeuvres.

Next steps

Compare Dyna (Neural) v.s. Dyna (perfect) v.s. non-Dyna
More detailed formulations, documentation, ... etc coming soon.

tjards / Q_learning_particle